Closed Bug 782412 Opened 12 years ago Closed 12 years ago

Improve UTF-16 decoders

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla20

People

(Reporter: bsurender, Assigned: emk)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

Part 1: Rename UCS2BE to UTF16 12 years ago Masatoshi Kimura [:emk] (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part 2: Return NS_OK_UDEC_MOREINPUT and NS_OK_UDEC_MOREOUTPUT as appropreate 12 years ago Masatoshi Kimura [:emk] (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part 3: Improve the BOM handling 12 years ago Masatoshi Kimura [:emk] (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part 4: Implement inputErrorBehavior 12 years ago Masatoshi Kimura [:emk] (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part 5: Remove a workaround from dom/encoding 12 years ago Masatoshi Kimura [:emk] (deleted), patch	hsivonen : review+	Details \| Diff \| Splinter Review
Folded patch for aurora 12 years ago Masatoshi Kimura [:emk] (deleted), patch	akeybl : approval-mozilla-aurora-	Details \| Diff \| Splinter Review

bsurender

Reporter

Description

•

12 years ago

The current encoder for UTF-16 code units does not support checking, validating and handling of unpaired surrogates in the incoming UTF-16 code units.

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

12 years ago

Component: DOM → Internationalization

bsurender

Reporter

Updated

•

12 years ago

Blocks: 764234

Component: Internationalization → DOM

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

12 years ago

Component: DOM → Internationalization

David Baron :dbaron:

Comment 1

•

12 years ago

The decoder (nsUCS2BEToUnicode.cpp) should be able to handle arbitrary Web content that might have unpaired surrogates in it, but shouldn't the encoders (in general) be able to assume they're being given sane content?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 2

•

12 years ago

The API Bonnie is implementing takes arbitrary stuff from JS and feeds it to the encoder so we have to do sanitization somewhere, either in the encoder or at a previous stage (and looping through the text twice seems undesirable).

David Baron :dbaron:

Comment 3

•

12 years ago

Is this really the only encoder with problems (of the ones you'll ever use, if it's clearly commented explaining what needs to be done to expand the set)?

Masatoshi Kimura [:emk]

Assignee

Comment 4

•

12 years ago

Decoders should have implemented some error handling for bug 174351.

Masatoshi Kimura [:emk]

Assignee

Updated

•

12 years ago

Blocks: encoding

Masatoshi Kimura [:emk]

Assignee

Comment 5

•

12 years ago

Attached patch Part 1: Rename UCS2BE to UTF16 (deleted) — Details — Splinter Review

Renamed misleading filenames. Also, these files contains not only BE converter but also LE and BOM-sniffing converters.

Assignee: nobody → VYV03354

Status: NEW → ASSIGNED

Attachment #680387 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 6

•

12 years ago

Attached patch Part 2: Return NS_OK_UDEC_MOREINPUT and NS_OK_UDEC_MOREOUTPUT as appropreate (deleted) — Details — Splinter Review

This is required to support TextDecoder's "stream" option correctly.

Attachment #680388 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 7

•

12 years ago

Attached patch Part 3: Improve the BOM handling (deleted) — Details — Splinter Review

Attachment #680389 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 8

•

12 years ago

Attached patch Part 4: Implement inputErrorBehavior (deleted) — Details — Splinter Review

This is needed for TextDecoder's "fatal" option.

Attachment #680390 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 9

•

12 years ago

Attached patch Part 5: Remove a workaround from dom/encoding (deleted) — Details — Splinter Review

In theory, BOM detection is not a part of decoders. But this will greatly simplify the implement of TextDecoder. And part 3 will make it easier to add an option to disable BOM detection.

Attachment #680391 - Flags: review?(hsivonen)

Masatoshi Kimura [:emk]

Assignee

Comment 10

•

12 years ago

Green on try: https://tbpl.mozilla.org/?tree=Try&rev=20f04d91e80c

Masatoshi Kimura [:emk]

Assignee

Updated

•

12 years ago

Attachment #680388 - Attachment is patch: true

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 11

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #9) > Created attachment 680391 [details] [diff] [review] > Part 5: Remove a workaround from dom/encoding > > In theory, BOM detection is not a part of decoders. But this will greatly > simplify the implement of TextDecoder. And part 3 will make it easier to add > an option to disable BOM detection. I’m a bit confused about the purpose of this bug vs. the patches vs. other potential patches. This bug says it’s about encoders. Why do the patches change decoders? How exactly does decoder BOM handling change here? E.g. nsScriptLoader relies on the decoders to swallow the BOM.

Masatoshi Kimura [:emk]

Assignee

Comment 12

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #11) > I’m a bit confused about the purpose of this bug vs. the patches vs. other > potential patches. > > This bug says it’s about encoders. Why do the patches change decoders? Oh, I misread the bug title. I've morphed the bug because I already attached patches. I'll file a new bug if encoders needs to be fixed. > How exactly does decoder BOM handling change here? E.g. nsScriptLoader > relies on the decoders to swallow the BOM. UTF-16BE decoder and UTF-16LE decoder will continue to swallow the BOM. This patch will only fix their poor behaviors on some edge cases. For example, If only one byte was fed to the decoder, it returned NS_ERROR_ILLEGAL_INPUT. I added the own BOM handling to TextDecoder to avoid the bug. Without the workaround, TextDecoder("utf-16").decode(Uint8Array([0xFF]),{stream:true}) would throw the exception wrongly. With this patch, UTF-16 decoders will work saner so that the workaround is no longer needed. I didn't touch the BOM-sniffing decoder because it will not be used from the DOM after bug 801402 and eventually it will be removed.

Summary: Current nsIUnicodeEncoder, nsBasicEncoder, encoder implementation for UTF-16 does not support unpaired surrogates checking. → Improve UTF-16 decoders

Simon Montagu :smontagu

Updated

•

12 years ago

Attachment #680387 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Updated

•

12 years ago

Attachment #680390 - Flags: review?(smontagu) → review+

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 13

•

12 years ago

Comment on attachment 680391 [details] [diff] [review] Part 5: Remove a workaround from dom/encoding r+ on the assumption that the other patches work as described.

Attachment #680391 - Flags: review?(hsivonen) → review+

Masatoshi Kimura [:emk]

Assignee

Updated

•

12 years ago

Blocks: 814900

Simon Montagu :smontagu

Updated

•

12 years ago

Attachment #680388 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Updated

•

12 years ago

Attachment #680389 - Flags: review?(smontagu) → review+

Masatoshi Kimura [:emk]

Assignee

Updated

•

12 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 14

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9efe7e7b135 https://hg.mozilla.org/integration/mozilla-inbound/rev/3a77b9740c42 https://hg.mozilla.org/integration/mozilla-inbound/rev/c7cc475cbc41 https://hg.mozilla.org/integration/mozilla-inbound/rev/6d5d486cc044 https://hg.mozilla.org/integration/mozilla-inbound/rev/b960d3ce65bc

Flags: in-testsuite+

Keywords: checkin-needed

Ed Morley [:emorley]

Comment 15

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/d9efe7e7b135 https://hg.mozilla.org/mozilla-central/rev/3a77b9740c42 https://hg.mozilla.org/mozilla-central/rev/c7cc475cbc41 https://hg.mozilla.org/mozilla-central/rev/6d5d486cc044 https://hg.mozilla.org/mozilla-central/rev/b960d3ce65bc

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla20

Masatoshi Kimura [:emk]

Assignee

Comment 16

•

12 years ago

Attached patch Folded patch for aurora (deleted) — Details — Splinter Review

[Approval Request Comment] Bug caused by (feature/regressing bug #): bug 801402 User impact if declined: users will not be able to open Web pages encoded in UTF-16 big endian. Testing completed (on m-c, etc.): baked on m-c for one weeks. Risk to taking this patch (and alternatives if risky): moderately. If this is too risky, take attachment 684919 [details] [diff] [review] from bug 814900 instead. String or UUID changes made by this patch: none.

Attachment #688972 - Flags: approval-mozilla-aurora?

Lukas Blakk [:lsblakk] use ?needinfo

Comment 17

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #16) > Created attachment 688972 [details] [diff] [review] > Folded patch for aurora > > [Approval Request Comment] > Bug caused by (feature/regressing bug #): bug 801402 > User impact if declined: users will not be able to open Web pages encoded in > UTF-16 big endian. > Testing completed (on m-c, etc.): baked on m-c for one weeks. > Risk to taking this patch (and alternatives if risky): moderately. If this > is too risky, take attachment 684919 [details] [diff] [review] from bug > 814900 instead. > String or UUID changes made by this patch: none. What kind of risk are we looking at here? What sort of fallout would we see, could we potentially test for and catch early? Anything we can point QA to in order to be more sure of taking this fix over the one in but 814900? What is the difference between the two patches?

Masatoshi Kimura [:emk]

Assignee

Comment 18

•

12 years ago

(In reply to Lukas Blakk [:lsblakk] from comment #17) > What kind of risk are we looking at here? What sort of fallout would we > see, could we potentially test for and catch early? Anything we can point > QA to in order to be more sure of taking this fix over the one in but > 814900? What is the difference between the two patches? This patch rewritten UTF-16 decoder substantially. There might be a undiscovered problem in the new code. Second patch just changes back the canonical name of UTF-16 to UTF-16 (rather than UTF-16) which is the same setting as Firefox 18 or earlier.

Masatoshi Kimura [:emk]

Assignee

Comment 19

•

12 years ago

But I didn't find any specific concern at the moment.

Alex Keybl [:akeybl]

Comment 20

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #18) > (In reply to Lukas Blakk [:lsblakk] from comment #17) > > What kind of risk are we looking at here? What sort of fallout would we > > see, could we potentially test for and catch early? Anything we can point > > QA to in order to be more sure of taking this fix over the one in but > > 814900? What is the difference between the two patches? > > This patch rewritten UTF-16 decoder substantially. There might be a > undiscovered problem in the new code. Sounds like we should back out bug 801402 instead of taking this patch, since it wasn't ready to ride the trains.

Alex Keybl [:akeybl]

Updated

•

12 years ago

Attachment #688972 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-

Alex Keybl [:akeybl]

Comment 21

•

12 years ago

(In reply to Alex Keybl [:akeybl] from comment #20) > (In reply to Masatoshi Kimura [:emk] from comment #18) > > (In reply to Lukas Blakk [:lsblakk] from comment #17) > > > What kind of risk are we looking at here? What sort of fallout would we > > > see, could we potentially test for and catch early? Anything we can point > > > QA to in order to be more sure of taking this fix over the one in but > > > 814900? What is the difference between the two patches? > > > > This patch rewritten UTF-16 decoder substantially. There might be a > > undiscovered problem in the new code. > > Sounds like we should back out bug 801402 instead of taking this patch, > since it wasn't ready to ride the trains. Nevermind, we'll take bug 814900.

You need to log in before you can comment on or make changes to this bug.