Open
Bug 261263
Opened 20 years ago
Updated 2 years ago
validate charset (esp. UTF-8) of text/plain downloads (first N bytes) before accepting as text/plain
Categories
(Firefox :: File Handling, defect)
Firefox
File Handling
Tracking
()
NEW
People
(Reporter: darin.moz, Unassigned)
References
()
Details
validate charset of text/plain downloads (first N bytes) before accepting as text/plain. this is derived from bug 261207, in which a binary file is incorrectly served as "text/plain; charset=UTF-8"
Reporter | ||
Updated•20 years ago
|
Comment 1•20 years ago
|
||
This can presumably only be done for a very limited set of character sets, most notably UTF-8 -- and mislabelling of UTF-8 data is most likely to mean it's really ISO-8859-1 or similar. What exactly were you suggesting we should do? (i.e. what are the steps to reproduce, and what change in behaviour would you consider to mean the bug was fixed?)
Comment 2•20 years ago
|
||
At the moment, we only do our type sniffing if the server delivers "text/plain" or "text/plain; charset=ISO-8859-1" data. Darin's suggestion is that for "text/plain; charset=FOO" data we check whether it looks like data in the FOO charset, and if it does not that we sniff for the content type. The steps to reproduce are to load a non-text file sent as "text/plain; charset=FOO" where FOO is not ISO-8859-1 by the web server and check whether Mozilla tries to render it as plaintext or whether we sniff it as application/octet-stream. The slope is a slippery one. ;)
Comment 3•20 years ago
|
||
For most character sets, any byte-stream is valid. I would strongly recommend we don't go down this path...
Comment 4•20 years ago
|
||
(In reply to comment #3) > For most character sets, any byte-stream is valid. Which means that they won't be affected by anything we change here (would render as text/plain, as they do now). Or am I missing something? > I would strongly recommend we don't go down this path... Is that still the case given the previous statement I made in this comment? (Note that I don't feel particularly strongly about this, but I do think it's a pretty safe heuristic).
Comment 5•20 years ago
|
||
Are there really enough pages sent as UTF-8 that aren't text files to warrant this? (Does IE in XPSP2 do _any_ sniffing of text/plain content?) Special casing of certain character sets to do something that is explicitly the opposite of what the server said just seems like a really weird thing to do, especially given that we are effectively promoting content from the least dangerous content (text/plain) to something that will be processed by an application that is known to be rather buggy (Windows Media Player). Here's an idea, though: Why don't we display the content, but, if we think it might be binary data, display one of those cool "info bars" and say "This document appears to be an MPEG Video File. _Open_with_Media_Player_ [Close]" or similar? That would mean we did the right thing and were safe, but allowed the user to easily get to the content as if it was correctly labelled.
Comment 6•20 years ago
|
||
(In reply to comment #5) > (Does IE in XPSP2 do _any_ sniffing of text/plain content?) This will need to be answered by someone who has a Windows system somewhere nearby. > especially given that we are effectively promoting content from the least > dangerous content (text/plain) to something that will be processed by an > application that is known to be rather buggy (Windows Media Player). I agree that this is an issue. > Here's an idea, though: Why don't we display the content This on its own is enough to hang Mozilla or the OS or crash Mozilla in many cases (buggy fonts that give garbage for certain codepoints, bugs in xft, issues with font servers, trying to look up bogus codepoints in every single font of the system for millions of characters (since binary files can easily get to be megabytes in size), etc, etc). We could _not_ display the content and show the little info bar, but that's what we already do (except we show the helper app dialog). What we do need to do is add a "view as text" option on said helper app dialog.
Comment 7•20 years ago
|
||
> This on its own is enough to hang Mozilla or the OS or crash Mozilla in many > cases (buggy fonts that give garbage for certain codepoints, We shouldn't crash if we get bogus font data. That should be fixed. > bugs in xft, issues with font servers, Same as above, of course. Frankly I'd rather take my chances with xft bugs than Windows Media Player bugs. > trying to look up bogus codepoints in every single font of the system for > millions of characters (since binary files can easily get to be megabytes in > size), etc, etc). Why are we looking up codepoints for characters that aren't being painted? If there are really millions of bogus codepoints, does that mean there are less than millions of invalid bytes? If so, detecting it as invalid would be a problem anyway, no? > We could _not_ display the content and show the little info bar, but that's > what we already do (except we show the helper app dialog). What we do need to > do is add a "view as text" option on said helper app dialog. I thought this was for the "we really think it is text/plain, although now that you mention it..." case, not the "they said it was text/plain, but we didn't believe them for a minute" case. (We do need that feature too, but that's another bug.)
Comment 8•20 years ago
|
||
(In reply to comment #7) > We shouldn't crash if we get bogus font data. That should be fixed. A lot of the crashing is upstreadm (not in our control). > > bugs in xft, issues with font servers, > > Same as above, of course. Frankly I'd rather take my chances with xft bugs than > Windows Media Player bugs. The latter don't crash the app. > Why are we looking up codepoints for characters that aren't being painted? If we're loading the document as text/plain, we do try to paint it and all. > If there are really millions of bogus codepoints, does that mean there are > less than millions of invalid bytes? No, there would be millions of _total_ bogus bytes in the file. We'd only look at the first 1024 bytes, though; if there's nothing bogus in there, chances are the whole file is ok. > I thought this was for the "we really think it is text/plain, although now > that you mention it..." case, not the "they said it was text/plain, but we > didn't believe them for a minute" case. This is for the "they said it was text/plain, so chances are they're lying, because Apache just lies by default; let's do a quick test to see whether it could conceivably be text/plain" case.
Comment 9•20 years ago
|
||
In that case I don't understand bug 261207 comment 4, from which this bug was apparently derived. BTW, for text files, we really shouldn't be painting characters if they are off-screen, should I file a bug on that? Seems like that would be an easy win and a definite perf advantage in cases like this. (Or did I misunderstand you?) (And the crashes in upstream code obviously should be fixed too (whether by us or others); if they aren't then Mozilla won't be the only crashing app, it'll make the entire system unstable.)
Comment 10•20 years ago
|
||
(In reply to comment #9) > In that case I don't understand bug 261207 comment 4, from which this bug was > apparently derived. This bug is a suggestion to change our current "we don't think it's text/plain, so we'll check", criteria (which are hinted at in bug 261207 comment 4). > BTW, for text files, we really shouldn't be painting characters if they are > off-screen We don't as far as I know. But we have to get glyph info for them anyway, since it affects layout. > (And the crashes in upstream code obviously should be fixed too It's being worked on, but the upstream code's buggy versions are widely installed in numerous Linux distributions...
Comment 11•20 years ago
|
||
(In reply to comment #5) > Are there really enough pages sent as UTF-8 that aren't text files to warrant > this? (Does IE in XPSP2 do _any_ sniffing of text/plain content?) Yes, nearly all the files I've seen that are sent with the wrong MIME type that Mozilla doesn't currently sniff correctly are sent as text/plain UTF-8. Yes, as far as I can tell, by default the latest IE does all the sniffing that its predecessors do. There's an option to turn off sniffing, but unsurprisingly not many people choose to do that.
Comment 12•20 years ago
|
||
It looks like Apache on Redhat/Fedora is set to send text/plain pages as UTF-8 by default now: https://www.redhat.com/archives/fedora-list/2005-March/msg03022.html
Comment 13•20 years ago
|
||
We're running into this problem with Mozilla/Firefox themes now, which are distributed as .jar files. Most of our FTP mirrors are serving them with a text/plain filetype. We didn't notice before because most of them were also sending charset=ISO-8859-1, which this content-sniffing apparently kicks in on. A couple of them recently started serving them as UTF-8, and we started getting complaints. The correct answer is to badger the servers into fixing the mime types. I sent out an email today to all of our FTP mirrors asking them to set the mime type for jar files.
Updated•20 years ago
|
Summary: validate charset of text/plain downloads (first N bytes) before accepting as text/plain → validate charset (esp. UTF-8) of text/plain downloads (first N bytes) before accepting as text/plain
Comment 14•19 years ago
|
||
(In reply to comment #0) > ...a binary file is incorrectly served as "text/plain; charset=UTF-8" It looks like this Content-Type is becoming an increasingly common problem. According to http://www.hardforum.com/showpost.php?p=1027768766&postcount=9 it looks like the exact header line causing the problem is: Content-Type: text/plain; charset=UTF-8 I've found a few other examples with this exact Content-Type header line, and all are running Apache 2 on Fedora. See also http://forums.mozillazine.org/viewtopic.php?t=248906
Comment 15•19 years ago
|
||
Has anyone filed a bug on Fedora about this? This is totally a bug in either Apache (long filed on them) or the settings Fedora sets up for Apache by default.
Comment 17•18 years ago
|
||
(In reply to comment #15) > This is totally a bug in either Apache (long filed on them) or the > settings Fedora sets up for Apache by default. Apache bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=13986 Fedora bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197840
Updated•15 years ago
|
Assignee: file-handling → nobody
QA Contact: ian → file-handling
Comment 18•15 years ago
|
||
This has become less of a problem as Firefox has become more popular and Fedora installs Apache to serve WMV files as the correct MIME type. Apache is even removing the default MIME type in the next version of httpd. I think this should be a WONTFIX at this point, as the two bugs I linked to in comment #17 are fixed.
Updated•8 years ago
|
Product: Core → Firefox
Version: Trunk → unspecified
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•