Closed Bug 754824 Opened 13 years ago Closed 12 years ago

The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes

Categories

(Thunderbird :: Search, defect)

defect
Not set
normal

Tracking

(thunderbird13 wontfix, thunderbird14+ fixed, thunderbird15+ fixed)

RESOLVED FIXED
Thunderbird 16.0
Tracking Status
thunderbird13 --- wontfix
thunderbird14 + fixed
thunderbird15 + fixed

People

(Reporter: florian, Assigned: florian)

References

Details

Attachments

(2 files, 1 obsolete file)

Attached image Screenshot of the bug (deleted) —
I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot. The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Blocks: 518597
Attached patch Patch (obsolete) (deleted) — Splinter Review
Assignee: nobody → florian
Attachment #623644 - Flags: review?(bugmail)
Attachment #623644 - Flags: review?(bugmail) → review+
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 15.0
Comment on attachment 623644 [details] [diff] [review] Patch [Approval Request Comment] This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Attachment #623644 - Flags: approval-comm-aurora?
Attachment #623644 - Flags: approval-comm-aurora? → approval-comm-aurora+
> c >= 32768 c >= 65536
(In reply to Masatoshi Kimura [:emk] from comment #5) > > c >= 32768 > c >= 65536 Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
(In reply to Florian Quèze from comment #6) > (In reply to Masatoshi Kimura [:emk] from comment #5) > > > c >= 32768 > > c >= 65536 > > Why? The character with which I noticed the bug was <U+1F493>, and it's > coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467. Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
(In reply to Masatoshi Kimura [:emk] from comment #7) > it should be (c >= 55296 || c <= 57343). Right. Reopening to fix this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This time I actually looked for documentation about unicode instead of trying/guessing. http://en.wikipedia.org/wiki/UTF-16 says: "Code points U+D800 to U+DFFF The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character" http://en.wikipedia.org/wiki/UTF-8 says: "code points below U+0080 (which UTF-8 encodes in one byte)" "for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes" "Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16." "Invalid code points According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Attachment #623644 - Attachment is obsolete: true
Attachment #637159 - Flags: review?(bugmail)
Attachment #637159 - Flags: review?(bugmail) → review+
Comment on attachment 637159 [details] [diff] [review] Follow-up to only match UTF-16 surrogate halves [Approval Request Comment] My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Attachment #637159 - Flags: approval-comm-beta?
Attachment #637159 - Flags: approval-comm-aurora?
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 15.0 → Thunderbird 16.0
Attachment #637159 - Flags: approval-comm-beta?
Attachment #637159 - Flags: approval-comm-beta+
Attachment #637159 - Flags: approval-comm-aurora?
Attachment #637159 - Flags: approval-comm-aurora+
Setting status 14 to affected until we land the second patch on beta.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: