Closed
Bug 324985
Opened 19 years ago
Closed 19 years ago
First-Chance at Content Sniffing
Categories
(Core :: Networking, enhancement)
Core
Networking
Tracking
()
RESOLVED
FIXED
People
(Reporter: bugs, Assigned: Biesinger)
References
Details
(Keywords: fixed1.8.1)
Attachments
(2 files, 2 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
darin.moz
:
approval-branch-1.8.1+
|
Details | Diff | Splinter Review |
So, the current approach I'm using to implement RSS pretty printing is working much better than those attempted in times past but it has a shortcoming:
- feeds are commonly sent as less desirable content types, including text/xml and especially text/html.
Inserting a pre-parser for these content types is incredibly undesirable for performance reasons as it basically imposes a heavy burden on every page load.
I talked with the Safari team to find out what they did - they seem to handle text/html and application/octet-stream feeds just fine. Hyatt told me they used content sniffing in their networking library. He said we probably had something similar in Gecko for HTML. Darin told me we had sniffing that works only when the content type is unknown, and suggested we could make it sniff every time, and allow lightweight, optimized C++ components to register to receive content type, file extension and a small chunk of data at the start of the transfer and give them a chance to coerce content type to something more suitable.
Doing this would fix many issues with my current implementation, such as the fact that it fails miserably for text/html and imposes the performance bottleneck on text/xml documents which is probably not acceptable for shipping code because all feed documents would have their types coerced to application/rss+xml or application/atom+xml or something suitably unique and not internally handled like that, so that a custom content handler could be used instead of default gecko handling.
Comment 1•19 years ago
|
||
I don't see how we can sniff every time in a reasonable way as things stand without fixing the bug on content-encoding and stream converters not playing nice with each other first (and sadly, I can't seem to locate the bug).
Comment 2•19 years ago
|
||
Boris: If you find that bug please let me know. I would be interested in reading about the issue. Or, maybe you can summarize here or via email.
Assignee | ||
Comment 3•19 years ago
|
||
oh, you mean the sniffer would have problems when it gets the compressed content? I think we can ignore that for an initial implementation and just advice implementors to be careful when they see a content encoding on the channel.
taking bug, I intend to only implement this for nsBaseChannel on trunk, though.
Assignee: darin → cbiesinger
Comment 4•19 years ago
|
||
The problem is that once you get an OnDataAvailable there is no way to turn off decompression. So if you get a Content-Encoded channel you either have to not sniff at all or you can't sniff it to something that might need saving (since saving needs to preserve the compression as often as not).
Comment 5•19 years ago
|
||
See also bug 233047.
Maybe we don't have a bug filed on this issue.... I could have sworn we did. :(
Comment 6•19 years ago
|
||
If it's sent as text/html, it's not RSS. I don't care how many feeds are mislabelled. We should not and must not treat them as anything but HTML.
Comment 7•19 years ago
|
||
Hixie: How else do you suggest dealing with the numerous mis-labeled feeds out on the internet that other browsers (IE7 and Safari) happily display as the user would expect? Does FF have the inertia to turn such a tide?
Boris, Biesi: I think the main problem is nsIEncodedChannel::applyConversions. Because of that API, we need to defer initiating decompression until after the consumer receives OnStartRequest. That is indeed in conflict with the idea of sniffing that first chunk of data prior to the consumer receiving OnStartRequest. If it were not for that API, then we could happily initiate decompression and perform sniffing prior to OnStartRequest. One solution is to decompress the initial chunk of data, perform sniffing, and keep that initial chunk of decompressed data and decompression state around until after OnStartRequest. That requires tighter coupling between the channel implementation and the decompressor (or any other "content-encoding" decoder).
Comment 8•19 years ago
|
||
Unfortunately, it seems every single RSS reader does this crap. :(
Between that, attempts to not bother with XML well-formedness in feeds, and attempts to ignore the charset specified for feeds, it seems that we're basically pretending that feeds are not served over HTTP and have nothing to do with XML...
Darin, I know I've written up proposals on handling this compression issue before; either in Wiki form, in email, in newsgroups, or in a bug somewhere. I'll try to dig them up, I guess, but I don't have a lot of the mail I've sent over the last few years around, so that might be interesting.
Assignee | ||
Comment 9•19 years ago
|
||
note that I agree with comment 6. the reason I don't mind this change is, basically, application/octet-stream.
I want to implement this in a way that it happens before OnStartRequest.
Assignee | ||
Comment 10•19 years ago
|
||
maybe I should talk a bit how I want to implement this...
- modify nsIContentSniffer to take an additional nsIRequest argument
- create a helper class nsCategoryCache<T> which caches instantiated services for the entries of a category (for speed)
- in nsBaseChannel::OnStartRequest, ReadSegments from the input stream (probably have to store the earlier return value of OpenContentStream as a member variable)
- if the content type is unknown, use the unknown decoder as a first content sniffer (I'll make it implement nsIContentSniffer too)
- pass those read bytes to all the content sniffers
comments? :)
Assignee | ||
Comment 11•19 years ago
|
||
(probably set mContentType after a sniffer returns, so that the next one has the right value)
Comment 12•19 years ago
|
||
So then if the data is compressed consumers will see compressed data, right?
Comment 13•19 years ago
|
||
Biesi: So you would be replacing the current call to PushStreamConverter in OnStartRequest with an explicit call into nsIContentSniffer for the unknown content type decoder? Makes sense to me.
Boris: Yeah, I think that's what he's proposing. That seems like a reasonable start to this. I think we should expose the decompression code in a more friendly and direct way so that it may be used either by the sniffers or better yet by the channel prior to invoking the sniffers.
(This is going to force me to port HTTP to nsBaseChannel sooner than later!)
Assignee | ||
Comment 14•19 years ago
|
||
right.
hm... darin/bz: (re comment 7) couldn't we decompress just that initial junk of data, and pass that to content sniffers, but without consuming the data (so that the real listener still gets it)?
>Biesi: So you would be replacing the current call to PushStreamConverter in
>OnStartRequest with an explicit call into nsIContentSniffer for the unknown
>content type decoder? Makes sense to me.
Yep. This does have advantages: It means that applyConversion etc can be set by the real listener in onStartRequest.
> (This is going to force me to port HTTP to nsBaseChannel sooner than later!)
:-)
Severity: normal → enhancement
Comment 15•19 years ago
|
||
> couldn't we decompress just that initial junk of data, and pass that to content
> sniffers
For this specific case, we could... I still think we should solve the general arch problem, though.
Ben, is this something you're looking for for FF2, or FF3? The base channel stuff may not be quite FF2 material...
Assignee | ||
Comment 16•19 years ago
|
||
hmm, s/junk/chunk/, that was an interesting typo
darin suggested making this HTTP-specific for the 1.8 branch
Comment 17•19 years ago
|
||
This is definitely worth solving for FF2 as Ben has some really awesome UI ideas for RSS feed handling that would really be worth getting into FF2 ;-)
It would probably help if we had a more direct API to the decompression code. nsIStreamConverter is just so not convenient to work with.
Reporter | ||
Comment 18•19 years ago
|
||
What Darin said.
Comment 19•19 years ago
|
||
So I just realized that doing this in the channel has a major problem -- most consumers don't want to end up with RSS instead of XML (think XMLHttpRequest, etc). The only consumer where we want to do this sniffing is docshell, really (one _could_ try to make an argument for URILoader, but really URILoader only supports well loads into docshells as things stand).
Assignee | ||
Comment 20•19 years ago
|
||
ok, I'll kinda counter what I said on IRC... doing this in uriloader would be actually pretty hard, because I can't just read from an input stream in OnStartRequest, because I have none...
Comment 21•19 years ago
|
||
> So I just realized that doing this in the channel has a major problem -- most
> consumers don't want to end up with RSS instead of XML (think XMLHttpRequest,
> etc).
It sounds like we could simply add a way to flag requests as wanting content sniffing.
Comment 22•19 years ago
|
||
Yeah, if we do that then things are ok.
One other thought: the API we end up writing should make it very clear that there are security implications to doing any content sniffing and that implementors of our sniffing API should make sure they're aware of them.
Assignee | ||
Comment 23•19 years ago
|
||
(In reply to comment #22)
> One other thought: the API we end up writing should make it very clear that
> there are security implications to doing any content sniffing
like what?
Comment 24•19 years ago
|
||
Like the fact that if a firewall only allows through types A, B, and C and we change type A to a dangerous type D behind its back (here D == application/msword or text/html or any of a bunch of other things) then we become an attack vector for someone trying to get at the machines behind the firewall.
Note that this is already a problem with our text/plain sniffing and a large part of the reason we do NOT render sniffed text/plain documents in the browser automatically.
The fact that I've had to explain this multiple times is what makes me want to document it on the interface. Otherwise I fully expect exploitable extensions to pop up all over.
Comment 25•19 years ago
|
||
Is that also why a file sent as application/otect-stream but sniffed as an MP3 ignores the "always do this" checkbox in the open/save dialog?
Assignee | ||
Comment 26•19 years ago
|
||
er...
1) that doesn't sniff the content it just looks at the extension
2) you can't remember the choice because that works on mime types, and two octet-stream files may well be a quite different file type
Comment 27•19 years ago
|
||
> Like the fact that if a firewall only allows through types A, B, and C and we
> change type A to a dangerous type D behind its back (here D ==
> application/msword or text/html or any of a bunch of other things) then we
> become an attack vector for someone trying to get at the machines behind the
> firewall.
Isn't this an issue with IE browsers as well? If so, then what good would such a firewall be if it didn't scan the data as well as the advertized mime type? I would be surprised if sniffing isn't common in firewalls that attempt to block dangerous content types.
Comment 28•19 years ago
|
||
> Isn't this an issue with IE browsers as well?
Yes, and they've gotten a fair amount of bad press for it and they're changing the behavior in IE7 as I understand.
> If so, then what good would such a firewall be if it didn't scan the data as
> well as the advertized mime type?
Not much either way, actually. Once you have browsers sniffing, the firewall is damned if it does and damned if it doesn't. That is, the "best-case" scenario in a situation where a browser sniffs is that the firewall has to be very aggressive about blocking and users can't see stuff they really should be able to see (eg plaintext that happens to have some angle brackets in it). If the browser behaves just a tad, the situation for users is much improved.
In any case, my point is that introducing precisely the security issues from IE that people have been complaining about for years and that MS is just fixing is really not something to do likely. I don't think that the proposed RSS sniffer introduces a problem. I'm just worried about extensions. See http://dbaron.org/log/2006-01#e20060122a for more information...
Assignee | ||
Comment 29•19 years ago
|
||
a patch here could use a helper class I'll add in bug 315598
Assignee | ||
Comment 30•19 years ago
|
||
ok, here's the patch. requires the patches of both dependencies.
I'm not 100% sure whether my HTTP changes are correct here.
Attachment #211972 -
Flags: review?(darin)
Comment 31•19 years ago
|
||
Comment on attachment 211972 [details] [diff] [review]
patch
>diff --git a/netwerk/base/src/nsBaseChannel.cpp b/netwerk/base/src/nsBaseChannel.cpp
>+ const nsCOMArray<nsIContentSniffer> sniffers =
>+ gIOService->GetContentSniffers();
I think you meant to do this:
const nsCOMArray<nsIContentSniffer> &sniffers = ...
Otherwise, I believe you are making an unnecessary copy of the array.
>+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) {
>+ chan->SetContentType(newType);
>+ }
You should throw in a "break" statement here, right? Or, do you mean
to let the content sniffers fight over what the real content type is?
If there is going to be fighting like that, then should we give them
a way to prioritize themselves? (See for example RegisterFilter on
nsIProtocolProxyService.)
>diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h
>+ * Note that only channels with the LOAD_CALL_CONTENT_SNIFFERS flag will call
>+ * content sniffers.
...
>+#define NS_CONTENT_SNIFFER_CATEGORY "net-content-sniffers"
You should also say (somewhere) that channel implementations are not required
to honor the LOAD_CALL_CONTENT_SNIFFERS flag. They can simply ignore it for
whatever reason.
>diff --git a/netwerk/protocol/http/src/nsHttpChannel.cpp b/netwerk/protocol/http/src/nsHttpChannel.cpp
>+static nsresult
>+NS_NewInputStreamPump(nsInputStreamPump **result,
>+ nsIInputStream *stream,
>+ PRInt64 streamPos = nsInt64(-1),
>+ PRInt64 streamLen = nsInt64(-1),
>+ PRUint32 segsize = 0,
>+ PRUint32 segcount = 0,
>+ PRBool closeWhenDone = PR_FALSE)
>+{
>+ nsresult rv = NS_ERROR_OUT_OF_MEMORY;
>+ nsRefPtr<nsInputStreamPump> pump = new nsInputStreamPump();
>+ if (pump) {
>+ rv = pump->Init(stream, streamPos, streamLen,
>+ segsize, segcount, closeWhenDone);
>+ if (NS_SUCCEEDED(rv)) {
>+ *result = nsnull;
>+ pump.swap(*result);
>+ }
>+ }
>+ return rv;
>+}
Maybe this method should live on nsInputStreamPump and be called
by both this code as well as nsBaseChannel. Maybe call it:
nsInputStreamPump::Create(... params ...)
>+static void
>+CallTypeSniffers(void *aClosure, const PRUint8 *aData, PRUint32 aCount)
>+{
>+ nsIChannel *chan = NS_STATIC_CAST(nsIChannel*, aClosure);
>+
>+ const nsCOMArray<nsIContentSniffer> sniffers =
>+ gIOService->GetContentSniffers();
>+ PRUint32 length = sniffers.Count();
>+ for (PRUint32 i = 0; i < length; ++i) {
>+ nsCAutoString newType;
>+ nsresult rv =
>+ sniffers[i]->GetMIMETypeFromContent(chan, aData, aCount, newType);
>+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) {
>+ chan->SetContentType(newType);
>+ }
>+ }
>+}
We could also share this implementation. However, perhaps we should
just wait and make nsHttpChannel inherit from nsBaseChannel. Yeah,
that's okay. Please add a comment saying that if you want to go with
duplicated code for now.
r=darin
Attachment #211972 -
Flags: review?(darin) → review+
Assignee | ||
Comment 32•19 years ago
|
||
not sure that my wording for the new flag is so great, suggestions accepted :)
Attachment #211972 -
Attachment is obsolete: true
Attachment #213817 -
Flags: superreview?(bzbarsky)
Assignee | ||
Comment 33•19 years ago
|
||
meant to comment on that...
> Or, do you mean
> to let the content sniffers fight over what the real content type is?
That was my intention, but you're right, a priority would be good if I do that, so I added the break for now; I'll file a bug on adding priorization.
Comment 34•19 years ago
|
||
Comment on attachment 213817 [details] [diff] [review]
patch v2
>diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h
>+ * Not all channels may implement this load flag. See also
>+ * nsIChannel::LOAD_CALL_CONTENT_SNIFFERS.
We're not describing a load flag, so the first sentence here should be changed somehow.
sr=bzbarsky
Attachment #213817 -
Flags: superreview?(bzbarsky) → superreview+
Assignee | ||
Comment 35•19 years ago
|
||
- * Not all channels may implement this load flag. See also
+ * Not all channels may implement content sniffing. See also
Attachment #213817 -
Attachment is obsolete: true
Assignee | ||
Comment 36•19 years ago
|
||
fixed on trunk, leaving open for branch
Assignee | ||
Comment 37•19 years ago
|
||
HTTP-only
Attachment #213946 -
Flags: approval-branch-1.8.1?(darin)
Assignee | ||
Comment 38•19 years ago
|
||
Comment on attachment 213946 [details] [diff] [review]
patch for 1.8 branch
er, I'll remove this comment before checking in:
+// NOTE: This function duplicates code from nsBaseChannel. This will go away
+// once HTTP uses nsBaseChannel (part of bug 312760)
Comment 39•19 years ago
|
||
Comment on attachment 213946 [details] [diff] [review]
patch for 1.8 branch
>Index: netwerk/base/src/nsInputStreamPump.cpp
...
>+nsresult
>+nsInputStreamPump::Create(nsInputStreamPump **result,
...
>+ return rv;
>+}
>+
>+
>+
> struct PeekData {
nit: kill some of that whitespace?
Attachment #213946 -
Flags: approval-branch-1.8.1?(darin) → approval-branch-1.8.1+
Assignee | ||
Comment 40•19 years ago
|
||
fixed on the branch now too.
Assignee | ||
Comment 41•19 years ago
|
||
filed bug 329644 on calling all content sniffers
No longer blocks: 329644
You need to log in
before you can comment on or make changes to this bug.
Description
•