Closed Bug 324985 Opened 19 years ago Closed 19 years ago

First-Chance at Content Sniffing

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bugs, Assigned: Biesinger)

References

Details

(Keywords: fixed1.8.1)

Attachments

(2 files, 2 obsolete files)

So, the current approach I'm using to implement RSS pretty printing is working much better than those attempted in times past but it has a shortcoming: - feeds are commonly sent as less desirable content types, including text/xml and especially text/html. Inserting a pre-parser for these content types is incredibly undesirable for performance reasons as it basically imposes a heavy burden on every page load. I talked with the Safari team to find out what they did - they seem to handle text/html and application/octet-stream feeds just fine. Hyatt told me they used content sniffing in their networking library. He said we probably had something similar in Gecko for HTML. Darin told me we had sniffing that works only when the content type is unknown, and suggested we could make it sniff every time, and allow lightweight, optimized C++ components to register to receive content type, file extension and a small chunk of data at the start of the transfer and give them a chance to coerce content type to something more suitable. Doing this would fix many issues with my current implementation, such as the fact that it fails miserably for text/html and imposes the performance bottleneck on text/xml documents which is probably not acceptable for shipping code because all feed documents would have their types coerced to application/rss+xml or application/atom+xml or something suitably unique and not internally handled like that, so that a custom content handler could be used instead of default gecko handling.
I don't see how we can sniff every time in a reasonable way as things stand without fixing the bug on content-encoding and stream converters not playing nice with each other first (and sadly, I can't seem to locate the bug).
Boris: If you find that bug please let me know. I would be interested in reading about the issue. Or, maybe you can summarize here or via email.
oh, you mean the sniffer would have problems when it gets the compressed content? I think we can ignore that for an initial implementation and just advice implementors to be careful when they see a content encoding on the channel. taking bug, I intend to only implement this for nsBaseChannel on trunk, though.
Assignee: darin → cbiesinger
The problem is that once you get an OnDataAvailable there is no way to turn off decompression. So if you get a Content-Encoded channel you either have to not sniff at all or you can't sniff it to something that might need saving (since saving needs to preserve the compression as often as not).
See also bug 233047. Maybe we don't have a bug filed on this issue.... I could have sworn we did. :(
If it's sent as text/html, it's not RSS. I don't care how many feeds are mislabelled. We should not and must not treat them as anything but HTML.
Hixie: How else do you suggest dealing with the numerous mis-labeled feeds out on the internet that other browsers (IE7 and Safari) happily display as the user would expect? Does FF have the inertia to turn such a tide? Boris, Biesi: I think the main problem is nsIEncodedChannel::applyConversions. Because of that API, we need to defer initiating decompression until after the consumer receives OnStartRequest. That is indeed in conflict with the idea of sniffing that first chunk of data prior to the consumer receiving OnStartRequest. If it were not for that API, then we could happily initiate decompression and perform sniffing prior to OnStartRequest. One solution is to decompress the initial chunk of data, perform sniffing, and keep that initial chunk of decompressed data and decompression state around until after OnStartRequest. That requires tighter coupling between the channel implementation and the decompressor (or any other "content-encoding" decoder).
Unfortunately, it seems every single RSS reader does this crap. :( Between that, attempts to not bother with XML well-formedness in feeds, and attempts to ignore the charset specified for feeds, it seems that we're basically pretending that feeds are not served over HTTP and have nothing to do with XML... Darin, I know I've written up proposals on handling this compression issue before; either in Wiki form, in email, in newsgroups, or in a bug somewhere. I'll try to dig them up, I guess, but I don't have a lot of the mail I've sent over the last few years around, so that might be interesting.
note that I agree with comment 6. the reason I don't mind this change is, basically, application/octet-stream. I want to implement this in a way that it happens before OnStartRequest.
maybe I should talk a bit how I want to implement this... - modify nsIContentSniffer to take an additional nsIRequest argument - create a helper class nsCategoryCache<T> which caches instantiated services for the entries of a category (for speed) - in nsBaseChannel::OnStartRequest, ReadSegments from the input stream (probably have to store the earlier return value of OpenContentStream as a member variable) - if the content type is unknown, use the unknown decoder as a first content sniffer (I'll make it implement nsIContentSniffer too) - pass those read bytes to all the content sniffers comments? :)
(probably set mContentType after a sniffer returns, so that the next one has the right value)
So then if the data is compressed consumers will see compressed data, right?
Biesi: So you would be replacing the current call to PushStreamConverter in OnStartRequest with an explicit call into nsIContentSniffer for the unknown content type decoder? Makes sense to me. Boris: Yeah, I think that's what he's proposing. That seems like a reasonable start to this. I think we should expose the decompression code in a more friendly and direct way so that it may be used either by the sniffers or better yet by the channel prior to invoking the sniffers. (This is going to force me to port HTTP to nsBaseChannel sooner than later!)
right. hm... darin/bz: (re comment 7) couldn't we decompress just that initial junk of data, and pass that to content sniffers, but without consuming the data (so that the real listener still gets it)? >Biesi: So you would be replacing the current call to PushStreamConverter in >OnStartRequest with an explicit call into nsIContentSniffer for the unknown >content type decoder? Makes sense to me. Yep. This does have advantages: It means that applyConversion etc can be set by the real listener in onStartRequest. > (This is going to force me to port HTTP to nsBaseChannel sooner than later!) :-)
Severity: normal → enhancement
> couldn't we decompress just that initial junk of data, and pass that to content > sniffers For this specific case, we could... I still think we should solve the general arch problem, though. Ben, is this something you're looking for for FF2, or FF3? The base channel stuff may not be quite FF2 material...
hmm, s/junk/chunk/, that was an interesting typo darin suggested making this HTTP-specific for the 1.8 branch
This is definitely worth solving for FF2 as Ben has some really awesome UI ideas for RSS feed handling that would really be worth getting into FF2 ;-) It would probably help if we had a more direct API to the decompression code. nsIStreamConverter is just so not convenient to work with.
What Darin said.
So I just realized that doing this in the channel has a major problem -- most consumers don't want to end up with RSS instead of XML (think XMLHttpRequest, etc). The only consumer where we want to do this sniffing is docshell, really (one _could_ try to make an argument for URILoader, but really URILoader only supports well loads into docshells as things stand).
ok, I'll kinda counter what I said on IRC... doing this in uriloader would be actually pretty hard, because I can't just read from an input stream in OnStartRequest, because I have none...
> So I just realized that doing this in the channel has a major problem -- most > consumers don't want to end up with RSS instead of XML (think XMLHttpRequest, > etc). It sounds like we could simply add a way to flag requests as wanting content sniffing.
Yeah, if we do that then things are ok. One other thought: the API we end up writing should make it very clear that there are security implications to doing any content sniffing and that implementors of our sniffing API should make sure they're aware of them.
(In reply to comment #22) > One other thought: the API we end up writing should make it very clear that > there are security implications to doing any content sniffing like what?
Like the fact that if a firewall only allows through types A, B, and C and we change type A to a dangerous type D behind its back (here D == application/msword or text/html or any of a bunch of other things) then we become an attack vector for someone trying to get at the machines behind the firewall. Note that this is already a problem with our text/plain sniffing and a large part of the reason we do NOT render sniffed text/plain documents in the browser automatically. The fact that I've had to explain this multiple times is what makes me want to document it on the interface. Otherwise I fully expect exploitable extensions to pop up all over.
Is that also why a file sent as application/otect-stream but sniffed as an MP3 ignores the "always do this" checkbox in the open/save dialog?
er... 1) that doesn't sniff the content it just looks at the extension 2) you can't remember the choice because that works on mime types, and two octet-stream files may well be a quite different file type
> Like the fact that if a firewall only allows through types A, B, and C and we > change type A to a dangerous type D behind its back (here D == > application/msword or text/html or any of a bunch of other things) then we > become an attack vector for someone trying to get at the machines behind the > firewall. Isn't this an issue with IE browsers as well? If so, then what good would such a firewall be if it didn't scan the data as well as the advertized mime type? I would be surprised if sniffing isn't common in firewalls that attempt to block dangerous content types.
> Isn't this an issue with IE browsers as well? Yes, and they've gotten a fair amount of bad press for it and they're changing the behavior in IE7 as I understand. > If so, then what good would such a firewall be if it didn't scan the data as > well as the advertized mime type? Not much either way, actually. Once you have browsers sniffing, the firewall is damned if it does and damned if it doesn't. That is, the "best-case" scenario in a situation where a browser sniffs is that the firewall has to be very aggressive about blocking and users can't see stuff they really should be able to see (eg plaintext that happens to have some angle brackets in it). If the browser behaves just a tad, the situation for users is much improved. In any case, my point is that introducing precisely the security issues from IE that people have been complaining about for years and that MS is just fixing is really not something to do likely. I don't think that the proposed RSS sniffer introduces a problem. I'm just worried about extensions. See http://dbaron.org/log/2006-01#e20060122a for more information...
a patch here could use a helper class I'll add in bug 315598
Depends on: 315598
No longer depends on: 325177
Attached patch patch (obsolete) (deleted) — Splinter Review
ok, here's the patch. requires the patches of both dependencies. I'm not 100% sure whether my HTTP changes are correct here.
Attachment #211972 - Flags: review?(darin)
Comment on attachment 211972 [details] [diff] [review] patch >diff --git a/netwerk/base/src/nsBaseChannel.cpp b/netwerk/base/src/nsBaseChannel.cpp >+ const nsCOMArray<nsIContentSniffer> sniffers = >+ gIOService->GetContentSniffers(); I think you meant to do this: const nsCOMArray<nsIContentSniffer> &sniffers = ... Otherwise, I believe you are making an unnecessary copy of the array. >+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) { >+ chan->SetContentType(newType); >+ } You should throw in a "break" statement here, right? Or, do you mean to let the content sniffers fight over what the real content type is? If there is going to be fighting like that, then should we give them a way to prioritize themselves? (See for example RegisterFilter on nsIProtocolProxyService.) >diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h >+ * Note that only channels with the LOAD_CALL_CONTENT_SNIFFERS flag will call >+ * content sniffers. ... >+#define NS_CONTENT_SNIFFER_CATEGORY "net-content-sniffers" You should also say (somewhere) that channel implementations are not required to honor the LOAD_CALL_CONTENT_SNIFFERS flag. They can simply ignore it for whatever reason. >diff --git a/netwerk/protocol/http/src/nsHttpChannel.cpp b/netwerk/protocol/http/src/nsHttpChannel.cpp >+static nsresult >+NS_NewInputStreamPump(nsInputStreamPump **result, >+ nsIInputStream *stream, >+ PRInt64 streamPos = nsInt64(-1), >+ PRInt64 streamLen = nsInt64(-1), >+ PRUint32 segsize = 0, >+ PRUint32 segcount = 0, >+ PRBool closeWhenDone = PR_FALSE) >+{ >+ nsresult rv = NS_ERROR_OUT_OF_MEMORY; >+ nsRefPtr<nsInputStreamPump> pump = new nsInputStreamPump(); >+ if (pump) { >+ rv = pump->Init(stream, streamPos, streamLen, >+ segsize, segcount, closeWhenDone); >+ if (NS_SUCCEEDED(rv)) { >+ *result = nsnull; >+ pump.swap(*result); >+ } >+ } >+ return rv; >+} Maybe this method should live on nsInputStreamPump and be called by both this code as well as nsBaseChannel. Maybe call it: nsInputStreamPump::Create(... params ...) >+static void >+CallTypeSniffers(void *aClosure, const PRUint8 *aData, PRUint32 aCount) >+{ >+ nsIChannel *chan = NS_STATIC_CAST(nsIChannel*, aClosure); >+ >+ const nsCOMArray<nsIContentSniffer> sniffers = >+ gIOService->GetContentSniffers(); >+ PRUint32 length = sniffers.Count(); >+ for (PRUint32 i = 0; i < length; ++i) { >+ nsCAutoString newType; >+ nsresult rv = >+ sniffers[i]->GetMIMETypeFromContent(chan, aData, aCount, newType); >+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) { >+ chan->SetContentType(newType); >+ } >+ } >+} We could also share this implementation. However, perhaps we should just wait and make nsHttpChannel inherit from nsBaseChannel. Yeah, that's okay. Please add a comment saying that if you want to go with duplicated code for now. r=darin
Attachment #211972 - Flags: review?(darin) → review+
Attached patch patch v2 (obsolete) (deleted) — Splinter Review
not sure that my wording for the new flag is so great, suggestions accepted :)
Attachment #211972 - Attachment is obsolete: true
Attachment #213817 - Flags: superreview?(bzbarsky)
meant to comment on that... > Or, do you mean > to let the content sniffers fight over what the real content type is? That was my intention, but you're right, a priority would be good if I do that, so I added the break for now; I'll file a bug on adding priorization.
Comment on attachment 213817 [details] [diff] [review] patch v2 >diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h >+ * Not all channels may implement this load flag. See also >+ * nsIChannel::LOAD_CALL_CONTENT_SNIFFERS. We're not describing a load flag, so the first sentence here should be changed somehow. sr=bzbarsky
Attachment #213817 - Flags: superreview?(bzbarsky) → superreview+
Attached patch patch v2.1 (deleted) — Splinter Review
- * Not all channels may implement this load flag. See also + * Not all channels may implement content sniffing. See also
Attachment #213817 - Attachment is obsolete: true
fixed on trunk, leaving open for branch
Attached patch patch for 1.8 branch (deleted) — Splinter Review
HTTP-only
Attachment #213946 - Flags: approval-branch-1.8.1?(darin)
Comment on attachment 213946 [details] [diff] [review] patch for 1.8 branch er, I'll remove this comment before checking in: +// NOTE: This function duplicates code from nsBaseChannel. This will go away +// once HTTP uses nsBaseChannel (part of bug 312760)
Comment on attachment 213946 [details] [diff] [review] patch for 1.8 branch >Index: netwerk/base/src/nsInputStreamPump.cpp ... >+nsresult >+nsInputStreamPump::Create(nsInputStreamPump **result, ... >+ return rv; >+} >+ >+ >+ > struct PeekData { nit: kill some of that whitespace?
Attachment #213946 - Flags: approval-branch-1.8.1?(darin) → approval-branch-1.8.1+
fixed on the branch now too.
Status: NEW → RESOLVED
Closed: 19 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
filed bug 329644 on calling all content sniffers
No longer blocks: 329644
Blocks: 329644
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: