405011 - Text is cut off when containing unicode supplementary characters (outside BMP) with MySQL as backend

Reporter

Description

•

17 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en; rv:1.8.1.8) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty) Build Identifier: I wanted to enter a text with some very special characters (cannot enter it here :)) and it was cut off. See https://bugzilla.mozilla.org/show_bug.cgi?id=404856 Reproducible: Always

maix

Reporter

Comment 1

•

17 years ago

(https://bugzilla.mozilla.org/show_bug.cgi?id=404856#c6)

Max Kanat-Alexander

Comment 2

•

17 years ago

Does it contain any (or only) Unicode characters which have to do with bidirectional text processing (BiDi)? In that case, Bugzilla is designed to strip those characters. Otherwise, could you explain what those characters are? I don't have any font which contains them, it seems. In any case, bug 363153 is likely to fix the problem in the current CVS HEAD of Bugzilla. Try it on http://landfill.bugzilla.org/bugzilla-tip/

OS: Linux → All

Hardware: PC → All

Max Kanat-Alexander

Updated

•

17 years ago

Severity: normal → trivial

Version: unspecified → 3.0.2

Simon Montagu :smontagu

Comment 3

•

17 years ago

(In reply to comment #2) > In any case, bug 363153 is likely to fix the problem in the current CVS HEAD of > Bugzilla. Try it on http://landfill.bugzilla.org/bugzilla-tip/ The problem seems to occur with any Unicode character above U+FFFF. I tried it in https://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=6136 and it's not fixed.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Max Kanat-Alexander

Comment 4

•

17 years ago

(In reply to comment #3) > The problem seems to occur with any Unicode character above U+FFFF. What Unicode characters are above U+FFFF??

Simon Montagu :smontagu

Comment 5

•

17 years ago

There are over 46000 Unicode characters above U+FFFF (not including Private Use Areas). See http://www.unicode.org/Public/UNIDATA/Blocks.txt

Max Kanat-Alexander

Comment 6

•

17 years ago

(In reply to comment #5) > There are over 46000 Unicode characters above U+FFFF (not including Private Use > Areas). See http://www.unicode.org/Public/UNIDATA/Blocks.txt Thanks. :-) Looks like most of them aren't that common (mostly ancient languages and some supplementary characters), but it's still worth investigating.

Severity: trivial → minor

Max Kanat-Alexander

Updated

•

17 years ago

Keywords: intl

Simon Montagu :smontagu

Comment 7

•

17 years ago

I'm fairly sure this is a regression: AFAIR I entered some Plane 1 characters into bug 297943 comment 8 and they appeared correctly at the time.

Max Kanat-Alexander

Comment 8

•

17 years ago

(In reply to comment #7) > I'm fairly sure this is a regression: AFAIR I entered some Plane 1 characters > into bug 297943 comment 8 and they appeared correctly at the time. It's entirely possible that this is a problem in MySQL's handling of UTF-8, in that case, or some obscure problem with the Perl Encode module (though that's much less likely).

A. Shimono [:himorin]

Comment 9

•

17 years ago

mkanat: is the codepage for the bmo mysql is 'utf8'? if so, you cannot use 4-byte utf-8 at bmo. (utf8 in mysql = CESU-8) refer http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html > RFC 3629 describes encoding sequences that take from one to four bytes. > Currently, MySQL support for UTF-8 does not include four-byte sequences. (An > older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 > sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; > for this reason, sequences with five and six bytes are no longer used.) i once heard that utf8_4 will be added to mysql in the future i don't know about pgsql :-)

Max Kanat-Alexander

Comment 10

•

17 years ago

(In reply to comment #9) > mkanat: is the codepage for the bmo mysql is 'utf8'? Yes. :-) Okay, so that's our problem--right now MySQL just doesn't support it. So unfortunately that makes this bug "INVALID" (though that's not the most accurate resolution in this case--it just means that this isn't a bug we can fix, since it's a MySQL problem). For those who wonder "why did this work before"? It's because before, we stored raw bytes in the database instead of characters, which meant that sorting was done on raw bytes instead of the correct Unicode collation. We pulled out those raw bytes, and then Perl displayed them correctly. The ability to correctly sort multi-byte or non-ASCII characters in the database is a big enough advantage to make it currently acceptable that we can't display supplementary characters anymore. If there's any other solution for this that we can implement in Bugzilla itself, feel free to let me know, but for now this is something that needs to be fixed in MySQL or whatever database an installation is using.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → INVALID

patch, v1 12 years ago Frédéric Buclin (deleted), patch		Details \| Diff \| Splinter Review
patch, v1.1 12 years ago Frédéric Buclin (deleted), patch		Details \| Diff \| Splinter Review
patch, v1.2 12 years ago Frédéric Buclin (deleted), patch		Details \| Diff \| Splinter Review
patch, v1.3 12 years ago Frédéric Buclin (deleted), patch		Details \| Diff \| Splinter Review
patch, v2 12 years ago Frédéric Buclin (deleted), patch	gerv : review+	Details \| Diff \| Splinter Review
disable warnings in Perl < 5.13.9 11 years ago Frédéric Buclin (deleted), patch	gerv : review+	Details \| Diff \| Splinter Review
Comment before sumbission, showing emoji and text after that got truncated 8 years ago cincodenada (deleted), image/png		Details