Closed Bug 541245 Opened 15 years ago Closed 15 years ago

UTF8 <-> UTF16 conversion/comparison code is buggy/inconsistent

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.3a2

People

(Reporter: sicking, Assigned: sicking)

References

(Depends on 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

Part1: Fix UTF8 -> UTF16 conversion code 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part2: Fix UTF8 <-> UTF16 comparison code 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch		Details \| Diff \| Splinter Review
Part3: Add tests 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part2: Fix UTF8 <-> UTF16 comparison code 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review
Part4: Deal with all illegal characters when converting. Not yet tested. 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch		Details \| Diff \| Splinter Review
Part4: Make illegal char handling consistent 15 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch	smontagu : review+	Details \| Diff \| Splinter Review

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Description

•

15 years ago

There's a few problems with the current UTF8/UTF16 conversion/comparison code. * Converting UTF8->UTF16 generally converts overlong UTF8 sequences to UTF16 replacement characters. This is good to avoid security issues by 'hiding' characters in an overlong encoded sequence. However CompareUTF8toUTF16 does not do the same thing. * There's a bug in the UTF8->UTF16 conversion code that makes us measure the length of the resulting UTF8 code if a 4-byte UTF8 sequence is overlong. This results in us truncating the result. * The UTF8->UTF16 conversion code does error checking by checking that the measured length is the same as the actual conversion length. It's unclear that this catches all errors. * There's no tests Patches coming up

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 1

•

15 years ago

Attached patch Part1: Fix UTF8 -> UTF16 conversion code (deleted) — Details — Splinter Review

This fixes the bug in the length calculation. It also simplifies error checking by checking the existing error flag rather than comparing lengths. Also removes some left-over code from when we had fragmented strings. Really we don't need to use character sinks at all since we don't have fragmented strings, but I'd rather do that separately.

Assignee: nobody → jonas

Attachment #422868 - Flags: review?(dbaron)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 2

•

15 years ago

Attached patch Part2: Fix UTF8 <-> UTF16 comparison code (obsolete) (deleted) — Details — Splinter Review

Move replacement character handling into the UTF8 enumerator so that it happens in the conversion and comparison code likewise. Also remove some dead UTF8 enumerator code.

Attachment #422872 - Flags: review?(dbaron)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 3

•

15 years ago

Attached patch Part3: Add tests (deleted) — Details — Splinter Review

Attachment #422873 - Flags: review?(dbaron)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Updated

•

15 years ago

Blocks: 534136

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 4

•

15 years ago

Attached patch Part2: Fix UTF8 <-> UTF16 comparison code (deleted) — Details — Splinter Review

Something went wrong in the diffing. This is the full patch.

Attachment #422872 - Attachment is obsolete: true

Attachment #422893 - Flags: review?(dbaron)

Attachment #422872 - Flags: review?(dbaron)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 5

•

15 years ago

Comment on attachment 422868 [details] [diff] [review] Part1: Fix UTF8 -> UTF16 conversion code Simon, would you be able to review these?

Attachment #422868 - Flags: review?(dbaron) → review?(smontagu)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Updated

•

15 years ago

Attachment #422873 - Flags: review?(dbaron) → review?(smontagu)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Updated

•

15 years ago

Attachment #422893 - Flags: review?(dbaron) → review?(smontagu)

Simon Montagu :smontagu

Comment 6

•

15 years ago

I don't think I can review these, because I'm not an XPCOM peer.

Benjamin Smedberg

Comment 7

•

15 years ago

I'm more than happy to delegate the reviews here.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 8

•

15 years ago

Just to be clear (sorry, ESL), you're happy to let Simon review? Or to review yourself?

Benjamin Smedberg

Comment 9

•

15 years ago

To let smontagu review.

Simon Montagu :smontagu

Comment 10

•

15 years ago

Comment on attachment 422868 [details] [diff] [review] Part1: Fix UTF8 -> UTF16 conversion code >+++ b/xpcom/string/public/nsUTF8Utils.h >@@ -484,28 +486,45 @@ class CalculateUTF8Length >+ // The below code therefor only adds 1 to mLength if the UTF8 >+ // data will produce a decoded character which is greater than >+ // or equal to 0x010000 and less than 0x0110000. >+ // It doesn't matter what to do the case where p + 4 > end >+ // since no UTF16 characters will be written in that case by >+ // ConvertUTF8toUTF16 >+ >+ if (p + 4 <= end) { >+ PRUint32 c = ((PRUint32)(p[0] & 0x07)) << 6 | >+ ((PRUint32)(p[1] & 0x30)); >+ if (c >= 0x010 && c < 0x110) >+ ++mLength; This bit-twiddling looks right to me, but it's obscure to say the least. Can you document it a bit more so that it's comprehensible without looking up the UTF-8 RFC?

Attachment #422868 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Comment 11

•

15 years ago

Comment on attachment 422868 [details] [diff] [review] Part1: Fix UTF8 -> UTF16 conversion code Plus some nits to the comments: >@@ -484,28 +486,45 @@ class CalculateUTF8Length >+ // The below code therefor only adds 1 to mLength if the UTF8 "The code below therefore" >+ // data will produce a decoded character which is greater than >+ // or equal to 0x010000 and less than 0x0110000. >+ // It doesn't matter what to do the case where p + 4 > end "to do in the case"

Simon Montagu :smontagu

Comment 12

•

15 years ago

Comment on attachment 422893 [details] [diff] [review] Part2: Fix UTF8 <-> UTF16 comparison code >+++ b/xpcom/string/public/nsUTF8Utils.h >+ else if ( ucs4 >= 0xD800 && >+ (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF || >+ ucs4 >= UCS_END)) >+ { >+ // Surrogates and prohibited characters >+ ucs4 = UCS2_REPLACEMENT_CHAR; >+ } >+ Firstly, shouldn't the condition be if ((ucs4 >= 0xD800 && ucs4 <= 0xDFFF) || ucs4 == 0xFFFE || ucs4 == 0xFFFF || ucs4 >= UCS_END) ? (Even if you say that they are logically equivalent and your way is more optimized, I still think it's confusing) Secondly, the prohibited characters should include --FFFE and --FFFF in all 17 planes and also 0xFDD0 - 0xFDEF. I suggest defining IS_NONCHARACTER in nsCharTraits.h to include all of those, adding a test for IS_NONCHARACTER to IS_VALID_CHAR, and then using ENSURE_VALID_CHAR in your code.

Simon Montagu :smontagu

Updated

•

15 years ago

Attachment #422873 - Flags: review?(smontagu) → review+

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 13

•

15 years ago

(In reply to comment #12) > (From update of attachment 422893 [details] [diff] [review]) > >+++ b/xpcom/string/public/nsUTF8Utils.h > > >+ else if ( ucs4 >= 0xD800 && > >+ (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF || > >+ ucs4 >= UCS_END)) > >+ { > >+ // Surrogates and prohibited characters > >+ ucs4 = UCS2_REPLACEMENT_CHAR; > >+ } > >+ > > Firstly, shouldn't the condition be > if ((ucs4 >= 0xD800 && ucs4 <= 0xDFFF) || > ucs4 == 0xFFFE || ucs4 == 0xFFFF || ucs4 >= UCS_END) > ? (Even if you say that they are logically equivalent and your way is more > optimized, I still think it's confusing) Why is it confusing? There's no need to check if ucs4 == 0xFFFE if we know it's less than 0xD800. I do think performance is quite important here since this is a test we run for every single character during conversion. So just doing one test rather than four in the common case seems like a decent win. > Secondly, the prohibited characters should include --FFFE and --FFFF in all 17 > planes and also 0xFDD0 - 0xFDEF. I suggest defining IS_NONCHARACTER in > nsCharTraits.h to include all of those, adding a test for IS_NONCHARACTER to > IS_VALID_CHAR, and then using ENSURE_VALID_CHAR in your code. Hmm.. the old code didn't do this, but I guess we could add this if you are sure this is desired behavior.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 14

•

15 years ago

So is 0xFDD0 - 0xFDEF also illegal in all 17 planes? Or just the first one?

Simon Montagu :smontagu

Comment 15

•

15 years ago

(In reply to comment #13) > Why is it confusing? There's no need to check if ucs4 == 0xFFFE if we know it's > less than 0xD800. I do think performance is quite important here since this is > a test we run for every single character during conversion. So just doing one > test rather than four in the common case seems like a decent win. OK, fair enough. > Hmm.. the old code didn't do this, but I guess we could add this if you are > sure this is desired behavior. Yes, I think the old code is wrong and we should be checking for all noncharacters. (In reply to comment #14) > So is 0xFDD0 - 0xFDEF also illegal in all 17 planes? Or just the first one? Only in plane 0. See section 16.7 in http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 16

•

15 years ago

So I started coding up covering all illegal characters when converting between UTF8<->UTF16. It turns into a non trivial amount of work unfortunately since there's now a lot more cases where we need to check for illegal characters. So not only do we need to check for more values every time we check for illegal characters. We also have to check in more places since for example a UTF16 surrogate pair can now result in an invalid character. Is there a reason we're checking for these illegal characters at all while doing conversion? Currently the UTF16 -> UTF8 conversion code doesn't look for any illegal values other than broken surrogate pairs, and we don't seem to be hurting from this.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 17

•

15 years ago

Attached patch Part4: Deal with all illegal characters when converting. Not yet tested. (deleted) — Details — Splinter Review

I think this is what it'd look like, but need to write tests to make sure it's correct. Let me know if this is something we really want to do.

Attachment #425617 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 18

•

15 years ago

Well in principle I think we should be doing either all or nothing, i.e. if we check for 0xFFFE and 0xFFFF we should check for the other non-characters as well. Also since we use these converters often to produce strings which we then pass to platform APIs, we really should be rigorous about "not exchanging non-characters between processes"

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 19

•

15 years ago

I agree about consistency, but I'd prefer to remove the 0xFFFE/0xFFFF checks. Aren't we already in the situation where we're passing noncharacters to various APIs? From javascript we get UTF16, which I'd imagine we then either pass directly to platform APIs, or convert to UTF8 and pass to platform APIs. In neither case we currently check for noncharacters since the UTF16->UTF8 conversion code doesn't do any such checks.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 20

•

15 years ago

Attached patch Part4: Make illegal char handling consistent (deleted) — Details — Splinter Review

I think this is a better way to deal with illegal characters in the conversion code. This makes us consistent between utf8->utf16 and utf16->utf8 conversion code in that neither converts illegal characters to replacement characters. If we really want to ensure that we never pass illegal utf characters to various platform APIs, I think trying to enforce that in the conversion code is the wrong strategy. Either we need to make sure that all of our strings are proper utf8/utf16, in which case we don't need any checks in the conversion code (though we could add assertions). Or we ensure that we check for illegal characters right before passing strings to platform APIs. Checking in conversion code is IMHO the worst of both worlds since we neither can rely on a string being correct (it might have never been converted), but we're still paying the cost in the conversion code.

Attachment #428354 - Flags: review?(smontagu)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Updated

•

15 years ago

Attachment #425617 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 21

•

15 years ago

Comment on attachment 428354 [details] [diff] [review] Part4: Make illegal char handling consistent >diff --git a/xpcom/string/public/nsUTF8Utils.h b/xpcom/string/public/nsUTF8Utils.h >--- a/xpcom/string/public/nsUTF8Utils.h >+++ b/xpcom/string/public/nsUTF8Utils.h >@@ -129,18 +129,17 @@ public: > else if ( ucs4 >= 0xD800 && >- (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF || >- ucs4 >= UCS_END)) >+ (ucs4 <= 0xDFFF || ucs4 >= UCS_END)) > { > // Surrogates and prohibited characters > ucs4 = UCS2_REPLACEMENT_CHAR; > } > "Surrogates and code points outside the Unicode range"

Attachment #428354 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Updated

•

15 years ago

Attachment #422893 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Comment 22

•

15 years ago

(In reply to comment #20) > If we really want to ensure that we never pass illegal utf characters to > various platform APIs, I think trying to enforce that in the conversion code is > the wrong strategy. I have argued exactly this myself in the past, so I guess I'm convinced. > > Either we need to make sure that all of our strings are proper utf8/utf16, in > which case we don't need any checks in the conversion code (though we could add > assertions). Yes, let's do this (we already have bug 456391 on UTF8 strings, and we can extend it to cover UTF16 as well).

Nomis101

Comment 23

•

15 years ago

Is it possible that this broke Thunderbird Trunk? I now get this error: /src/mailnews/base/util/nsMsgI18N.cpp:549: error: no matching function for call to ‘UTF8CharEnumerator::NextChar(const char**, const char*&)’

Mark Banner (:standard8)

Comment 24

•

15 years ago

(In reply to comment #23) > Is it possible that this broke Thunderbird Trunk? I now get this error: > /src/mailnews/base/util/nsMsgI18N.cpp:549: error: no matching function for call > to ‘UTF8CharEnumerator::NextChar(const char**, const char*&)’ Yes, I'm working on it.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 25

•

15 years ago

Checked in! http://hg.mozilla.org/mozilla-central/rev/c5520407a4ad http://hg.mozilla.org/mozilla-central/rev/8558f8ffb9b0 http://hg.mozilla.org/mozilla-central/rev/f8046860d0cf http://hg.mozilla.org/mozilla-central/rev/e65d3f9b0ede Thanks for the reviews!

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Depends on: 549098

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Flags: in-testsuite+

Target Milestone: --- → mozilla1.9.3a2

Daniel Holbert [:dholbert]

Updated

•

15 years ago

Depends on: 552793

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: String → XPCOM

You need to log in before you can comment on or make changes to this bug.