324985 - First-Chance at Content Sniffing

Ben Goodger (use ben at mozilla dot org for email)

Reporter

Description

•

19 years ago

So, the current approach I'm using to implement RSS pretty printing is working much better than those attempted in times past but it has a shortcoming: - feeds are commonly sent as less desirable content types, including text/xml and especially text/html. Inserting a pre-parser for these content types is incredibly undesirable for performance reasons as it basically imposes a heavy burden on every page load. I talked with the Safari team to find out what they did - they seem to handle text/html and application/octet-stream feeds just fine. Hyatt told me they used content sniffing in their networking library. He said we probably had something similar in Gecko for HTML. Darin told me we had sniffing that works only when the content type is unknown, and suggested we could make it sniff every time, and allow lightweight, optimized C++ components to register to receive content type, file extension and a small chunk of data at the start of the transfer and give them a chance to coerce content type to something more suitable. Doing this would fix many issues with my current implementation, such as the fact that it fails miserably for text/html and imposes the performance bottleneck on text/xml documents which is probably not acceptable for shipping code because all feed documents would have their types coerced to application/rss+xml or application/atom+xml or something suitably unique and not internally handled like that, so that a custom content handler could be used instead of default gecko handling.

Boris Zbarsky [:bzbarsky]

Comment 1

•

19 years ago

I don't see how we can sniff every time in a reasonable way as things stand without fixing the bug on content-encoding and stream converters not playing nice with each other first (and sadly, I can't seem to locate the bug).

Darin Fisher

Comment 2

•

19 years ago

Boris: If you find that bug please let me know. I would be interested in reading about the issue. Or, maybe you can summarize here or via email.

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 3

•

19 years ago

oh, you mean the sniffer would have problems when it gets the compressed content? I think we can ignore that for an initial implementation and just advice implementors to be careful when they see a content encoding on the channel. taking bug, I intend to only implement this for nsBaseChannel on trunk, though.

Assignee: darin → cbiesinger

Boris Zbarsky [:bzbarsky]

Comment 4

•

19 years ago

The problem is that once you get an OnDataAvailable there is no way to turn off decompression. So if you get a Content-Encoded channel you either have to not sniff at all or you can't sniff it to something that might need saving (since saving needs to preserve the compression as often as not).

Boris Zbarsky [:bzbarsky]

Comment 5

•

19 years ago

See also bug 233047. Maybe we don't have a bug filed on this issue.... I could have sworn we did. :(

Hixie (not reading bugmail)

Comment 6

•

19 years ago

If it's sent as text/html, it's not RSS. I don't care how many feeds are mislabelled. We should not and must not treat them as anything but HTML.

Darin Fisher

Comment 7

•

19 years ago

Hixie: How else do you suggest dealing with the numerous mis-labeled feeds out on the internet that other browsers (IE7 and Safari) happily display as the user would expect? Does FF have the inertia to turn such a tide? Boris, Biesi: I think the main problem is nsIEncodedChannel::applyConversions. Because of that API, we need to defer initiating decompression until after the consumer receives OnStartRequest. That is indeed in conflict with the idea of sniffing that first chunk of data prior to the consumer receiving OnStartRequest. If it were not for that API, then we could happily initiate decompression and perform sniffing prior to OnStartRequest. One solution is to decompress the initial chunk of data, perform sniffing, and keep that initial chunk of decompressed data and decompression state around until after OnStartRequest. That requires tighter coupling between the channel implementation and the decompressor (or any other "content-encoding" decoder).

Boris Zbarsky [:bzbarsky]

Comment 8

•

19 years ago

Unfortunately, it seems every single RSS reader does this crap. :( Between that, attempts to not bother with XML well-formedness in feeds, and attempts to ignore the charset specified for feeds, it seems that we're basically pretending that feeds are not served over HTTP and have nothing to do with XML... Darin, I know I've written up proposals on handling this compression issue before; either in Wiki form, in email, in newsgroups, or in a bug somewhere. I'll try to dig them up, I guess, but I don't have a lot of the mail I've sent over the last few years around, so that might be interesting.

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 9

•

19 years ago

note that I agree with comment 6. the reason I don't mind this change is, basically, application/octet-stream. I want to implement this in a way that it happens before OnStartRequest.

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 10

•

19 years ago

maybe I should talk a bit how I want to implement this... - modify nsIContentSniffer to take an additional nsIRequest argument - create a helper class nsCategoryCache<T> which caches instantiated services for the entries of a category (for speed) - in nsBaseChannel::OnStartRequest, ReadSegments from the input stream (probably have to store the earlier return value of OpenContentStream as a member variable) - if the content type is unknown, use the unknown decoder as a first content sniffer (I'll make it implement nsIContentSniffer too) - pass those read bytes to all the content sniffers comments? :)

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 11

•

19 years ago

(probably set mContentType after a sniffer returns, so that the next one has the right value)

Boris Zbarsky [:bzbarsky]

Comment 12

•

19 years ago

So then if the data is compressed consumers will see compressed data, right?

Darin Fisher

Comment 13

•

19 years ago

Biesi: So you would be replacing the current call to PushStreamConverter in OnStartRequest with an explicit call into nsIContentSniffer for the unknown content type decoder? Makes sense to me. Boris: Yeah, I think that's what he's proposing. That seems like a reasonable start to this. I think we should expose the decompression code in a more friendly and direct way so that it may be used either by the sniffers or better yet by the channel prior to invoking the sniffers. (This is going to force me to port HTTP to nsBaseChannel sooner than later!)

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 14

•

19 years ago

right. hm... darin/bz: (re comment 7) couldn't we decompress just that initial junk of data, and pass that to content sniffers, but without consuming the data (so that the real listener still gets it)? >Biesi: So you would be replacing the current call to PushStreamConverter in >OnStartRequest with an explicit call into nsIContentSniffer for the unknown >content type decoder? Makes sense to me. Yep. This does have advantages: It means that applyConversion etc can be set by the real listener in onStartRequest. > (This is going to force me to port HTTP to nsBaseChannel sooner than later!) :-)

Severity: normal → enhancement

Boris Zbarsky [:bzbarsky]

Comment 15

•

19 years ago

> couldn't we decompress just that initial junk of data, and pass that to content > sniffers For this specific case, we could... I still think we should solve the general arch problem, though. Ben, is this something you're looking for for FF2, or FF3? The base channel stuff may not be quite FF2 material...

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 16

•

19 years ago

hmm, s/junk/chunk/, that was an interesting typo darin suggested making this HTTP-specific for the 1.8 branch

Darin Fisher

Comment 17

•

19 years ago

This is definitely worth solving for FF2 as Ben has some really awesome UI ideas for RSS feed handling that would really be worth getting into FF2 ;-) It would probably help if we had a more direct API to the decompression code. nsIStreamConverter is just so not convenient to work with.

Ben Goodger (use ben at mozilla dot org for email)

Reporter

Comment 18

•

19 years ago

What Darin said.

Ben Goodger (use ben at mozilla dot org for email)

Reporter

Updated

•

19 years ago

Blocks: 324588

Boris Zbarsky [:bzbarsky]

Comment 19

•

19 years ago

So I just realized that doing this in the channel has a major problem -- most consumers don't want to end up with RSS instead of XML (think XMLHttpRequest, etc). The only consumer where we want to do this sniffing is docshell, really (one _could_ try to make an argument for URILoader, but really URILoader only supports well loads into docshells as things stand).

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Updated

•

19 years ago

Depends on: 325177

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 20

•

19 years ago

ok, I'll kinda counter what I said on IRC... doing this in uriloader would be actually pretty hard, because I can't just read from an input stream in OnStartRequest, because I have none...

Darin Fisher

Comment 21

•

19 years ago

> So I just realized that doing this in the channel has a major problem -- most > consumers don't want to end up with RSS instead of XML (think XMLHttpRequest, > etc). It sounds like we could simply add a way to flag requests as wanting content sniffing.

Boris Zbarsky [:bzbarsky]

Comment 22

•

19 years ago

Yeah, if we do that then things are ok. One other thought: the API we end up writing should make it very clear that there are security implications to doing any content sniffing and that implementors of our sniffing API should make sure they're aware of them.

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 23

•

19 years ago

(In reply to comment #22) > One other thought: the API we end up writing should make it very clear that > there are security implications to doing any content sniffing like what?

Boris Zbarsky [:bzbarsky]

Comment 24

•

19 years ago

Like the fact that if a firewall only allows through types A, B, and C and we change type A to a dangerous type D behind its back (here D == application/msword or text/html or any of a bunch of other things) then we become an attack vector for someone trying to get at the machines behind the firewall. Note that this is already a problem with our text/plain sniffing and a large part of the reason we do NOT render sniffed text/plain documents in the browser automatically. The fact that I've had to explain this multiple times is what makes me want to document it on the interface. Otherwise I fully expect exploitable extensions to pop up all over.

Jesse Ruderman

Comment 25

•

19 years ago

Is that also why a file sent as application/otect-stream but sniffed as an MP3 ignores the "always do this" checkbox in the open/save dialog?

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 26

•

19 years ago

er... 1) that doesn't sniff the content it just looks at the extension 2) you can't remember the choice because that works on mime types, and two octet-stream files may well be a quite different file type

Darin Fisher

Comment 27

•

19 years ago

> Like the fact that if a firewall only allows through types A, B, and C and we > change type A to a dangerous type D behind its back (here D == > application/msword or text/html or any of a bunch of other things) then we > become an attack vector for someone trying to get at the machines behind the > firewall. Isn't this an issue with IE browsers as well? If so, then what good would such a firewall be if it didn't scan the data as well as the advertized mime type? I would be surprised if sniffing isn't common in firewalls that attempt to block dangerous content types.

Boris Zbarsky [:bzbarsky]

Comment 28

•

19 years ago

> Isn't this an issue with IE browsers as well? Yes, and they've gotten a fair amount of bad press for it and they're changing the behavior in IE7 as I understand. > If so, then what good would such a firewall be if it didn't scan the data as > well as the advertized mime type? Not much either way, actually. Once you have browsers sniffing, the firewall is damned if it does and damned if it doesn't. That is, the "best-case" scenario in a situation where a browser sniffs is that the firewall has to be very aggressive about blocking and users can't see stuff they really should be able to see (eg plaintext that happens to have some angle brackets in it). If the browser behaves just a tad, the situation for users is much improved. In any case, my point is that introducing precisely the security issues from IE that people have been complaining about for years and that MS is just fixing is really not something to do likely. I don't think that the proposed RSS sniffer introduces a problem. I'm just worried about extensions. See http://dbaron.org/log/2006-01#e20060122a for more information...

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 29

•

19 years ago

a patch here could use a helper class I'll add in bug 315598

Depends on: 315598
No longer depends on: 325177

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Updated

•

19 years ago

Depends on: 325177

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 30

•

19 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

ok, here's the patch. requires the patches of both dependencies. I'm not 100% sure whether my HTTP changes are correct here.

Attachment #211972 - Flags: review?(darin)

Darin Fisher

Comment 31

•

19 years ago

Comment on attachment 211972 [details] [diff] [review] patch >diff --git a/netwerk/base/src/nsBaseChannel.cpp b/netwerk/base/src/nsBaseChannel.cpp >+ const nsCOMArray<nsIContentSniffer> sniffers = >+ gIOService->GetContentSniffers(); I think you meant to do this: const nsCOMArray<nsIContentSniffer> &sniffers = ... Otherwise, I believe you are making an unnecessary copy of the array. >+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) { >+ chan->SetContentType(newType); >+ } You should throw in a "break" statement here, right? Or, do you mean to let the content sniffers fight over what the real content type is? If there is going to be fighting like that, then should we give them a way to prioritize themselves? (See for example RegisterFilter on nsIProtocolProxyService.) >diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h >+ * Note that only channels with the LOAD_CALL_CONTENT_SNIFFERS flag will call >+ * content sniffers. ... >+#define NS_CONTENT_SNIFFER_CATEGORY "net-content-sniffers" You should also say (somewhere) that channel implementations are not required to honor the LOAD_CALL_CONTENT_SNIFFERS flag. They can simply ignore it for whatever reason. >diff --git a/netwerk/protocol/http/src/nsHttpChannel.cpp b/netwerk/protocol/http/src/nsHttpChannel.cpp >+static nsresult >+NS_NewInputStreamPump(nsInputStreamPump **result, >+ nsIInputStream *stream, >+ PRInt64 streamPos = nsInt64(-1), >+ PRInt64 streamLen = nsInt64(-1), >+ PRUint32 segsize = 0, >+ PRUint32 segcount = 0, >+ PRBool closeWhenDone = PR_FALSE) >+{ >+ nsresult rv = NS_ERROR_OUT_OF_MEMORY; >+ nsRefPtr<nsInputStreamPump> pump = new nsInputStreamPump(); >+ if (pump) { >+ rv = pump->Init(stream, streamPos, streamLen, >+ segsize, segcount, closeWhenDone); >+ if (NS_SUCCEEDED(rv)) { >+ *result = nsnull; >+ pump.swap(*result); >+ } >+ } >+ return rv; >+} Maybe this method should live on nsInputStreamPump and be called by both this code as well as nsBaseChannel. Maybe call it: nsInputStreamPump::Create(... params ...) >+static void >+CallTypeSniffers(void *aClosure, const PRUint8 *aData, PRUint32 aCount) >+{ >+ nsIChannel *chan = NS_STATIC_CAST(nsIChannel*, aClosure); >+ >+ const nsCOMArray<nsIContentSniffer> sniffers = >+ gIOService->GetContentSniffers(); >+ PRUint32 length = sniffers.Count(); >+ for (PRUint32 i = 0; i < length; ++i) { >+ nsCAutoString newType; >+ nsresult rv = >+ sniffers[i]->GetMIMETypeFromContent(chan, aData, aCount, newType); >+ if (NS_SUCCEEDED(rv) && !newType.IsEmpty()) { >+ chan->SetContentType(newType); >+ } >+ } >+} We could also share this implementation. However, perhaps we should just wait and make nsHttpChannel inherit from nsBaseChannel. Yeah, that's okay. Please add a comment saying that if you want to go with duplicated code for now. r=darin

Attachment #211972 - Flags: review?(darin) → review+

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 32

•

19 years ago

Attached patch patch v2 (obsolete) (deleted) — Details — Splinter Review

not sure that my wording for the new flag is so great, suggestions accepted :)

Attachment #211972 - Attachment is obsolete: true

Attachment #213817 - Flags: superreview?(bzbarsky)

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 33

•

19 years ago

meant to comment on that... > Or, do you mean > to let the content sniffers fight over what the real content type is? That was my intention, but you're right, a priority would be good if I do that, so I added the break for now; I'll file a bug on adding priorization.

Boris Zbarsky [:bzbarsky]

Comment 34

•

19 years ago

Comment on attachment 213817 [details] [diff] [review] patch v2 >diff --git a/netwerk/build/nsNetCID.h b/netwerk/build/nsNetCID.h >+ * Not all channels may implement this load flag. See also >+ * nsIChannel::LOAD_CALL_CONTENT_SNIFFERS. We're not describing a load flag, so the first sentence here should be changed somehow. sr=bzbarsky

Attachment #213817 - Flags: superreview?(bzbarsky) → superreview+

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 35

•

19 years ago

Attached patch patch v2.1 (deleted) — Details — Splinter Review

- * Not all channels may implement this load flag. See also + * Not all channels may implement content sniffing. See also

Attachment #213817 - Attachment is obsolete: true

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 36

•

19 years ago

fixed on trunk, leaving open for branch

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 37

•

19 years ago

Attached patch patch for 1.8 branch (deleted) — Details — Splinter Review

HTTP-only

Attachment #213946 - Flags: approval-branch-1.8.1?(darin)

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 38

•

19 years ago

Comment on attachment 213946 [details] [diff] [review] patch for 1.8 branch er, I'll remove this comment before checking in: +// NOTE: This function duplicates code from nsBaseChannel. This will go away +// once HTTP uses nsBaseChannel (part of bug 312760)

Darin Fisher

Comment 39

•

19 years ago

Comment on attachment 213946 [details] [diff] [review] patch for 1.8 branch >Index: netwerk/base/src/nsInputStreamPump.cpp ... >+nsresult >+nsInputStreamPump::Create(nsInputStreamPump **result, ... >+ return rv; >+} >+ >+ >+ > struct PeekData { nit: kill some of that whitespace?

Attachment #213946 - Flags: approval-branch-1.8.1?(darin) → approval-branch-1.8.1+

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 40

•

19 years ago

fixed on the branch now too.

Status: NEW → RESOLVED

Closed: 19 years ago

Keywords: fixed1.8.1

Resolution: --- → FIXED

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Updated

•

19 years ago

Blocks: 329644

Christian :Biesinger (don't email me, ping me on IRC)

Assignee

Comment 41

•

19 years ago

filed bug 329644 on calling all content sniffers

No longer blocks: 329644

Boris Zbarsky [:bzbarsky]

Updated

•

19 years ago

Blocks: 329644

patch 19 years ago Christian :Biesinger (don't email me, ping me on IRC) (deleted), patch	darin.moz : review+	Details \| Diff \| Splinter Review
patch v2 19 years ago Christian :Biesinger (don't email me, ping me on IRC) (deleted), patch	bzbarsky : superreview+	Details \| Diff \| Splinter Review
patch v2.1 19 years ago Christian :Biesinger (don't email me, ping me on IRC) (deleted), patch		Details \| Diff \| Splinter Review
patch for 1.8 branch 19 years ago Christian :Biesinger (don't email me, ping me on IRC) (deleted), patch	darin.moz : approval-branch-1.8.1+	Details \| Diff \| Splinter Review