Closed
Bug 754824
Opened 13 years ago
Closed 12 years ago
The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes
Categories
(Thunderbird :: Search, defect)
Thunderbird
Search
Tracking
(thunderbird13 wontfix, thunderbird14+ fixed, thunderbird15+ fixed)
RESOLVED
FIXED
Thunderbird 16.0
People
(Reporter: florian, Assigned: florian)
References
Details
Attachments
(2 files, 1 obsolete file)
(deleted),
image/png
|
Details | |
(deleted),
patch
|
asuth
:
review+
standard8
:
approval-comm-aurora+
standard8
:
approval-comm-beta+
|
Details | Diff | Splinter Review |
I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot.
The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Assignee | ||
Comment 1•13 years ago
|
||
Assignee: nobody → florian
Attachment #623644 -
Flags: review?(bugmail)
Updated•13 years ago
|
Attachment #623644 -
Flags: review?(bugmail) → review+
Assignee | ||
Comment 2•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 15.0
Assignee | ||
Comment 3•13 years ago
|
||
Comment on attachment 623644 [details] [diff] [review]
Patch
[Approval Request Comment]
This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Attachment #623644 -
Flags: approval-comm-aurora?
Updated•13 years ago
|
Attachment #623644 -
Flags: approval-comm-aurora? → approval-comm-aurora+
Assignee | ||
Comment 4•12 years ago
|
||
status-thunderbird13:
--- → wontfix
status-thunderbird14:
--- → fixed
Comment 5•12 years ago
|
||
> c >= 32768
c >= 65536
Assignee | ||
Comment 6•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #5)
> > c >= 32768
> c >= 65536
Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
Comment 7•12 years ago
|
||
(In reply to Florian Quèze from comment #6)
> (In reply to Masatoshi Kimura [:emk] from comment #5)
> > > c >= 32768
> > c >= 65536
>
> Why? The character with which I noticed the bug was <U+1F493>, and it's
> coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
Assignee | ||
Comment 8•12 years ago
|
||
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
Assignee | ||
Comment 9•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #7)
> it should be (c >= 55296 || c <= 57343).
Right. Reopening to fix this.
Status: RESOLVED → REOPENED
tracking-thunderbird14:
--- → ?
tracking-thunderbird15:
--- → ?
Resolution: FIXED → ---
Assignee | ||
Comment 10•12 years ago
|
||
This time I actually looked for documentation about unicode instead of trying/guessing.
http://en.wikipedia.org/wiki/UTF-16 says:
"Code points U+D800 to U+DFFF
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character"
http://en.wikipedia.org/wiki/UTF-8 says:
"code points below U+0080 (which UTF-8 encodes in one byte)"
"for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes"
"Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16."
"Invalid code points
According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Attachment #623644 -
Attachment is obsolete: true
Attachment #637159 -
Flags: review?(bugmail)
Updated•12 years ago
|
Attachment #637159 -
Flags: review?(bugmail) → review+
Assignee | ||
Comment 11•12 years ago
|
||
Comment on attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves
[Approval Request Comment]
My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Attachment #637159 -
Flags: approval-comm-beta?
Attachment #637159 -
Flags: approval-comm-aurora?
Assignee | ||
Comment 12•12 years ago
|
||
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Status: REOPENED → RESOLVED
Closed: 13 years ago → 12 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 15.0 → Thunderbird 16.0
Updated•12 years ago
|
Attachment #637159 -
Flags: approval-comm-beta?
Attachment #637159 -
Flags: approval-comm-beta+
Attachment #637159 -
Flags: approval-comm-aurora?
Attachment #637159 -
Flags: approval-comm-aurora+
Updated•12 years ago
|
Comment 13•12 years ago
|
||
Setting status 14 to affected until we land the second patch on beta.
Comment 14•12 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•