Closed Bug 1070763 Opened 10 years ago Closed 6 years ago

XHR does not sniff the BOM before decoding responseText

Categories

(Core :: DOM: Core & HTML, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox58 - wontfix
firefox59 --- wontfix
firefox63 --- fixed

People

(Reporter: slama-h, Assigned: wisniewskit, Mentored)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [lang=c++][wptsync upstream])

Attachments

(4 files, 1 obsolete file)

Attached image Bad: Galaxy_Manual_Firefox.jpg (deleted) —
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0
Build ID: 20140917194002

Steps to reproduce:

Go to URL http://downloadcenter.samsung.com/content/MC/201303/20130327211500937/EN/Eng/start_here.html (User Manual for tablet Samsung Galaxy Note 8.0). User agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0.


Actual results:

User Manual's table of contents (panel on left-hand side) displays OK but content area (large panel on right-hand side) shows raw HTML preceded by 2 question marks (0x3f). See attachment Galaxy_Manual_Firefox.jpg


Expected results:

Content area should look as rendered by Google Chrome - see attachment Galaxy_Manual_Chrome.jpg
Attached image Good: Galaxy_Manual_Chrome.jpg (deleted) —
Note 1: Not sure about Version: 9 Branch and other keywords; I'm not a developer.
Note 2: The error also occurs with Firefox for Android (32.0.1) on tablet Samsung Galaxy Note 8.0 (GT-N5110) with Android 4.4.2. Chrome for Android renders properly.
Severity: normal → minor
I can confirm this with Firefox32
Status: UNCONFIRMED → NEW
Ever confirmed: true
This is not an HTML parser bug.

The site uses jQuery's "ajax" function to retrieve the HTML as XHR's responseText. The HTML is little-endian UTF-16 with BOM.

For some reason that I haven't figured out at this time, XHR decodes the response as a rough ASCII superset instead of sniffing the BOM and decoding as UTF-16.
Component: HTML: Parser → DOM: Core & HTML
OS: Windows 7 → All
Hardware: x86_64 → All
Version: 9 Branch → unspecified
Summary: Raw HTML displayed in main content DIV instead of being rendered → XHR decodes BOMful UTF-16 responseText as non-UTF-16
Mentor: bzbarsky
Whiteboard: [lang=c++]
Hallvors, could you check if there's wpt tests for this?
Flags: needinfo?(hsteen)
Also a problem with UTF-8: see bug 1193947.
Summary: XHR decodes BOMful UTF-16 responseText as non-UTF-16 → XHR does not sniff the BOM before decoding responseText
Blocks: xhr
I have a simple patch in the works which has XHRs sniff for a BOM and change decoders accordingly, before they actually process the response.

However, it's still not letting Firefox pass one of the tests in the WPT responsetext-decoding.htm, the one named: XMLHttpRequest: responseText decoding (text/plain %C2)

I'm honestly not sure how to fix that, nor if it's even related to this specific bug. Thoughts, Anne?
(In reply to Thomas Wisniewski from comment #10)
> I have a simple patch in the works which has XHRs sniff for a BOM and change
> decoders accordingly, before they actually process the response.

Does XHR expose detail that makes the built-in morphing of mozilla::Decoder according to BOM not suffice?
Thomas, I think that might be due to XMLHttpRequest not properly signaling EOF. I thought Henri fixed that, but maybe not for all callers?
(In reply to Anne (:annevk) from comment #12)
> Thomas, I think that might be due to XMLHttpRequest not properly signaling
> EOF. I thought Henri fixed that, but maybe not for all callers?

I've made both the EOF and the BOM bugs in XHR fix*able*, but I haven't fixed the XHR callers.
(In reply to Thomas Wisniewski from comment #10)
> I'm honestly not sure how to fix that, nor if it's even related to this
> specific bug. Thoughts, Anne?

Anne is right that that's about EOF. XHR needs to signal the EOF to the decoder, and the decoder will do the right thing.
I just tried adding this to the XHR OnStopRequest handler, but no dice (the same three failures still happen on the web platform test):

>  // Signal EOF to mDecoder
>  if (mDecoder) {
>    Span<uint8_t> src; // empty span
>    nsString dst;
>    dst.SetLength(1024);
>    uint32_t result;
>    size_t read;
>    size_t written;
>    bool hadErrors;
>    Tie(result, read, written, hadErrors) = mDecoder->DecodeToUTF16(src, dst, true);
>    Unused << hadErrors;
>    MOZ_ASSERT(read == 0, "How come an empty span was read form?");
>  }

Assuming that's sufficient to send the EOF signal, then something else must be tripping up the decoder. Thoughts?
Flags: needinfo?(hsivonen)
Attached patch process_bom_in_xhrs.diff (obsolete) (deleted) — Splinter Review
I figured I might as well post my patch here just in case.
(In reply to Thomas Wisniewski from comment #15)
> I just tried adding this to the XHR OnStopRequest handler, but no dice (the
> same three failures still happen on the web platform test):
> 
> >  // Signal EOF to mDecoder
> >  if (mDecoder) {
> >    Span<uint8_t> src; // empty span
> >    nsString dst;
> >    dst.SetLength(1024);
> >    uint32_t result;
> >    size_t read;
> >    size_t written;
> >    bool hadErrors;
> >    Tie(result, read, written, hadErrors) = mDecoder->DecodeToUTF16(src, dst, true);
> >    Unused << hadErrors;
> >    MOZ_ASSERT(read == 0, "How come an empty span was read form?");
> >  }
> 
> Assuming that's sufficient to send the EOF signal, then something else must
> be tripping up the decoder. Thoughts?

That should make the decoder process the eof, but your code quote doesn't show anything done with dst afterwards. Are you saying dst doesn't end up with a REPLACEMENT CHARACTER in it on the test case that has a single non-ASCII byte labeled as UTF-8?

(In reply to Thomas Wisniewski from comment #16)
> Created attachment 8912078 [details] [diff] [review]
> process_bom_in_xhrs.diff
> 
> I figured I might as well post my patch here just in case.

The patch seems to manage BOM sniffing manually unnecessarily. That is, I don't see the manual sniffing interacting with the XML declaration or anything like that.

If you do mDecoder = encoding->NewDecoder();, mDecoder will handle sniffing for you. After mDecoder has seen 3 or more bytes, mDecoder->Encoding() is guaranteed to return the post-sniffing encoding if you need to expose it.
Flags: needinfo?(hsivonen)
Ah, thanks, I was clearly too tired last night to really think my code through. The WPT sending just a %C2 character does indeed pass when I add this code-block to OnStopRequest instead:

>  if (mDecoder) {
>    Span<uint8_t> src; // empty span
>    XMLHttpRequestStringWriterHelper helper(mResponseText);
>    for (;;) {
>      if (!helper.AddCapacity(1024)) {
>        return NS_ERROR_OUT_OF_MEMORY;
>     }
>     auto dst = MakeSpan(helper.EndOfExistingData(), 1024);
>      uint32_t result;
>      size_t read;
>      size_t written;
>      bool hadErrors;
>      Tie(result, read, written, hadErrors) =
>        mDecoder->DecodeToUTF16(src, dst, true);
>      Unused << hadErrors;
>      MOZ_ASSERT(read == 0, "How come an empty span was read from?");
>      helper.AddLength(written);
>      if (result != kOutputFull) {
>        break; 
>      }
>    } 
>  }   

Unfortunately, it still doesn't pass the case when a UTF-16BE/LE BOM is sent with no other characters (or is sent twice/doubled, like %FE%FF%FE%FF). The stream ends up getting %FD%FD instead of the BOM being stripped (or twice that if the BOM was doubled). That is, it's picking UTF-8, even if I start it off with the correct encoding.

But my patch passes these cases, so I'm not sure what's going on here. Any hints?
Flags: needinfo?(hsivonen)
(In reply to Thomas Wisniewski from comment #18)
> >     auto dst = MakeSpan(helper.EndOfExistingData(), 1024);

1024 seems excessive. mDecoder->MaxUTF16BufferLength(src.Length()) should be much less than 1024. (The argument of MaxUTF16BufferLength() is misleadingly named in Encoding.h; I should fix that.)

> Unfortunately, it still doesn't pass the case when a UTF-16BE/LE BOM is sent
> with no other characters (or is sent twice/doubled, like %FE%FF%FE%FF). The
> stream ends up getting %FD%FD instead of the BOM being stripped (or twice
> that if the BOM was doubled). That is, it's picking UTF-8, even if I start
> it off with the correct encoding.

How do you instantiate mDecoder? (Should be encoding->NewDecoder().)
Flags: needinfo?(hsivonen)
>Should be encoding->NewDecoder()

Ah, that was it, thanks! (I did try that, but it turns out there were two places where we instantiate a decoder in the XHR code, and I had only changed one of them for my testing).

I'll get a patch ready ASAP.
Actually, I decided to skip writing all of that code in favor of just calling the existing XHR function AppendToResponseText which reads from the decoder (but changing that function to send DecodeToUTF16 aLast=true when given an empty source buffer as input).

That does the trick, and makes us start passing a good number of web platform tests. Curiously, two tests in encoding/unsupported-encodings.html still don't pass, and I'm not sure why:

UTF-32be with BOM should decode as windows-1252
utf-32be with BOM should decode as windows-1252

Meanwhile, the analogous tests with no BOM or with UTF-32le *do* pass:

UTF-32be with no BOM should decode as windows-1252
utf-32be with no BOM should decode as windows-1252
UTF-32LE with BOM should decode as UTF-16LE
utf-32le with BOM should decode as UTF-16LE

I'm guessing we can figure those cases out in a follow-up bug, however.
Alright, sent my patch up for review; it seems to pass try just fine: https://treeherder.mozilla.org/#/jobs?repo=try&revision=89c3acb2372a4e6229e7e78622aa973f68d284b0
Comment on attachment 8914573 [details]
Bug 1070763 - Ensure that XHRs sniff the BOM for non-JSON responseTypes, and flush the decoder upon end-of-stream;

https://reviewboard.mozilla.org/r/185910/#review190926

r+ provided that the EOF signaling is made explicit.

::: dom/xhr/XMLHttpRequestMainThread.cpp:569
(Diff revision 1)
>    size_t written;
>    bool hadErrors;
>    Tie(result, read, written, hadErrors) = mDecoder->DecodeToUTF16(
>      AsBytes(MakeSpan(aSrcBuffer, aSrcBufferLen)),
>      MakeSpan(helper.EndOfExistingData(), destBufferLen.value()),
> -    false);
> +    aSrcBufferLen == 0);

I don't see enough assurances that Necko won't report a zero-length segment in the middle of the stream. Please add `bool aLast = false` as an argument of `AppendToResponseText` and pass `true` explicitly on EOF instead of inferring it from the input length.
Attachment #8914573 - Flags: review?(hsivonen) → review+
(In reply to Thomas Wisniewski from comment #21)
> Curiously, two tests in encoding/unsupported-encodings.html
> still don't pass, and I'm not sure why:
> 
> UTF-32be with BOM should decode as windows-1252
> utf-32be with BOM should decode as windows-1252

What do those decode as?

> I'm guessing we can figure those cases out in a follow-up bug, however.

OK.
Ah, good idea to just pass an explicit parameter. Done.

Requesting checkin.
Assignee: nobody → wisniewskit
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/4bac27faa19b
Ensure that XHRs sniff the BOM for non-JSON responseTypes, and flush the decoder upon end-of-stream; r=hsivonen
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/4bac27faa19b
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Depends on: 1405571
KWierso, I'd like to request backing this patch out, as it is causing unexpected crashes (see bug 1405571) and I can't be sure whether the fix will be trivial.
Flags: needinfo?(wkocher)
Actually, I think we did find the fix, so let's not back it out after all.
Flags: needinfo?(wkocher)
The fix seems to have broken the internet :(
Yeah, I can only apologize for the bad timing (I didn't expect this much fallout to happen while I was on PTO).

The good news is that it seems like :baku may have already found the fix to unbreak things in bug 1407573.
Backout by archaeopteryx@coole-files.de:
https://hg.mozilla.org/mozilla-central/rev/f3da88fdee4a
Backed out changeset 4bac27faa19b for crashes and broken websites. r=backout a=backout
Backed out bug 1070763 and bug 1405571 for crashes and broken websites:

bug 1070763
https://hg.mozilla.org/mozilla-central/rev/f3da88fdee4ab79a2212f8d2d1efd37fb5009b58

bug 1405571
https://hg.mozilla.org/mozilla-central/rev/32fdad6bf6f0417d9b094a13e0670ddb76a085b8

See bug 1405571 and the dependencies (broken websites).
Status: RESOLVED → REOPENED
Flags: needinfo?(wisniewskit)
Resolution: FIXED → ---
Target Milestone: mozilla58 → ---
It seems to me that a relanding should include a robust fix for the root cause of the CORS proxy causing double calls to OnStopRequest.
I just duped Bug 1405571 (crash fallout from this patch landing) against this bug. Please ensure the issues raised there are addressed in the re-landing, thanks.
Tracking this for 58 since it seems complex and I am coming across duplicates.
I don't think it's necessary to track this one, since it's an old bug that hasn't been causing any major issues that I'm aware of (the real issues began when I tried to fix this, causing bug 1405571, which has been resolved for 57).
Flags: needinfo?(wisniewskit) → needinfo?(lhenry)
OK, I'll leave it to you then. Thanks!
Flags: needinfo?(lhenry)
Priority: -- → P3
So I've deduced that the failure which caused this to be backed out in bug 1405571 is *not* caused by OnStopRequest being called multiple times, but rather because of these early returns in OnStartRequest:

This can happen when any of the following early-returns happen in OnStartRequest, because mDecoder is only cleared after

>NS_IMETHODIMP
>XMLHttpRequestMainThread::OnStartRequest(nsIRequest *request, nsISupports *ctxt)
>{ 
>  // snip for brevity
> 
>  if (request != mChannel) {
>    // Can this still happen?
>    return NS_OK;
>  }
> 
>  // Don't do anything if we have been aborted
>  if (mState == XMLHttpRequest_Binding::UNSENT) {
>    return NS_OK;
>  }
> 
>  /* Apparently, Abort() should set UNSENT.  See bug 361773.
>     XHR2 spec says this is correct. */
>  if (mFlagAborted) {
>    NS_ERROR("Ugh, still getting data on an aborted XMLHttpRequest!");
> 
>    return NS_ERROR_UNEXPECTED;
>  }
> 
>  // Don't do anything if we have timed out.
>  if (mFlagTimedOut) {
>    return NS_OK;
>  }

When you realize that mDecoder is only set later in this function (during the call to DetectCharset), the reason for the crash becomes obvious: if an XHR object is re-used after hitting one of these early returns, it will have a stale mDecoder from its previous request. And of course, re-using that decoder rightly triggers the crash.

The fix for this is trivial; just nullify mDecoder during each call to open().

Unfortunately, engineering a test for the first early-return above is beyond me, but I managed to come up with a tests for the other three (I wanted to use crashtests instead of mochitests, but have no idea how to allow a crashtest to hit the network, which is necessary to trigger these cases as data, blob and file URLs won't ever hit these cases). Alternatively, I could move them into a new web platform test instead, but I'm not sure if that's the right thing to do for crashtests.

I'm doing a try-run with my rebased patch here to see what else I may have missed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f0df0841d7bcf8cf5e8a17c89b8ddfea2939033
Using wpt is fine.
This bug makes us look bad on http://w3c-test.org/encoding/unsupported-encodings.any.html , because the test harness uses XHR for testing BOM handling.
Henri, I could use a little help here. I'm unable to grasp why my revised patch is failing the final test in the (recently-added) WPT: http://w3c-test.org/xhr/overridemimetype-edge-cases.window.html

The failure message is:

>assert_equals: expected "\ufffd" but got "\ufffd\ufffd"

It's doing an XHR to:

>const testURL = "resources/status.py?type=" + encodeURIComponent("text/plain;charset=windows-1252") + "&content=%C2%F0";

The UTF-8 decoder is being selected as expected, and indeed only 2 chars are being sent to it in AppendToResponseText, 0xC2 and 0xF0.

But the call to UTF8Decoder->DecodeToUTF16 is converting that input into two replacement characters instead of the one the test expects. (For reference, the call is also returning hadError=true, and written=1).
Flags: needinfo?(hsivonen)
Why does the patch make XHR do its own BOM sniffing instead of letting mozilla::Decoder do it?

> The UTF-8 decoder is being selected as expected, and indeed only 2 chars are being sent to it in AppendToResponseText, 0xC2 and 0xF0.

The test looks wrong. 0xC2 and 0xF0 are both UTF-8 lead bytes. There are fewer REPLACEMENT CHARACTERs than bogus bytes if there are some UTF-8 trail bytes following a lead byte such that the bogus sequence forms a prefix of what would be a valid sequence. In this case 0xC2, 0xF0 is not a prefix of a valid sequence. 0xC2 is a prefix and 0xF0 is another prefix, so there are two REPLACEMENT CHARACTERs.
Flags: needinfo?(hsivonen) → needinfo?(twisniewski)
>Why does the patch make XHR do its own BOM sniffing instead of letting mozilla::Decoder do it?

It does... I guess you mean my commit message is too vague? (All the patch really does is change to using a non-BOM-stripping Decoder, call its DecodeToUTF16 one last time when the stream is over with aLast=true, and add a line to reset the decoder on open()).

>The test looks wrong.

Ah, ok, that makes sense to me. Anne, I'll change the test as part of this patch so it expects the correct number of REPLACEMENT CHARACTERS, unless you feel that's not the way to go?
Flags: needinfo?(twisniewski) → needinfo?(annevk)
Flags: needinfo?(hsivonen)
Sounds good to me, thanks for cleaning this up! (I suspect the test ended up being wrong here due to EOF decoding handling only being recently fixed in Firefox and not yet in other implementations.)
Flags: needinfo?(annevk)
(In reply to twisniewski from comment #52)
> >Why does the patch make XHR do its own BOM sniffing instead of letting mozilla::Decoder do it?
> 
> It does... I guess you mean my commit message is too vague?

I was looking at the wrong attachment. I was looking at attachment 8912078 [details] [diff] [review].
Flags: needinfo?(hsivonen)
Attachment #8912078 - Attachment is obsolete: true
Alright, I've revised the WPT in this version of the patch. A try run seems fine: https://treeherder.mozilla.org/#/jobs?repo=try&revision=63a0269304f14a831e785d8cbaecd5841dac4e2a

That said, I could not push a fresh review to MozReview (no permission to re-open the review there), so I decided to try this fancy new Phabricator thing we have going on now.
Comment on attachment 8999002 [details]
Bug 1070763 - Ensure that XHRs sniff the BOM for non-JSON responseTypes, and flush the decoder upon end-of-stream; r?hsivonen

Henri Sivonen (:hsivonen) has approved the revision.
Attachment #8999002 - Flags: review+
Thanks Henri! Requesting check-in. Sheriffs, please note that the Phabricator patch is the one intended to land; there was an earlier MozReview one which is now stale, and had been backed out.
Keywords: checkin-needed
Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8b97d3b93472
Ensure that XHRs sniff the BOM for non-JSON responseTypes, and flush the decoder upon end-of-stream; r=hsivonen
Keywords: checkin-needed
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/12411 for changes under testing/web-platform/tests
Whiteboard: [lang=c++] → [lang=c++][wptsync upstream]
Whiteboard: [lang=c++][wptsync upstream] → [lang=c++][wptsync upstream error]
Whiteboard: [lang=c++][wptsync upstream error] → [lang=c++][wptsync upstream]
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
https://hg.mozilla.org/mozilla-central/rev/8b97d3b93472
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Depends on: 1482830
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: