Closed
Bug 405011
Opened 17 years ago
Closed 11 years ago
Text is cut off when containing unicode supplementary characters (outside BMP) with MySQL as backend
Categories
(Bugzilla :: Database, defect)
Tracking
()
RESOLVED
FIXED
Bugzilla 5.0
People
(Reporter: maix42, Assigned: LpSolit)
References
(Blocks 1 open bug, )
Details
(Keywords: dataloss, Whiteboard: [requires MySQL 5.5.8, see comment 13])
Attachments
(3 files, 4 obsolete files)
(deleted),
patch
|
gerv
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
gerv
:
review+
|
Details | Diff | Splinter Review |
(deleted),
image/png
|
Details |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en; rv:1.8.1.8) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)
Build Identifier:
I wanted to enter a text with some very special characters (cannot enter it here :)) and it was cut off. See https://bugzilla.mozilla.org/show_bug.cgi?id=404856
Reproducible: Always
Comment 2•17 years ago
|
||
Does it contain any (or only) Unicode characters which have to do with bidirectional text processing (BiDi)? In that case, Bugzilla is designed to strip those characters.
Otherwise, could you explain what those characters are? I don't have any font which contains them, it seems.
In any case, bug 363153 is likely to fix the problem in the current CVS HEAD of Bugzilla. Try it on http://landfill.bugzilla.org/bugzilla-tip/
OS: Linux → All
Hardware: PC → All
Updated•17 years ago
|
Severity: normal → trivial
Version: unspecified → 3.0.2
Comment 3•17 years ago
|
||
(In reply to comment #2)
> In any case, bug 363153 is likely to fix the problem in the current CVS HEAD of
> Bugzilla. Try it on http://landfill.bugzilla.org/bugzilla-tip/
The problem seems to occur with any Unicode character above U+FFFF.
I tried it in https://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=6136 and it's not fixed.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 4•17 years ago
|
||
(In reply to comment #3)
> The problem seems to occur with any Unicode character above U+FFFF.
What Unicode characters are above U+FFFF??
Comment 5•17 years ago
|
||
There are over 46000 Unicode characters above U+FFFF (not including Private Use Areas). See http://www.unicode.org/Public/UNIDATA/Blocks.txt
Comment 6•17 years ago
|
||
(In reply to comment #5)
> There are over 46000 Unicode characters above U+FFFF (not including Private Use
> Areas). See http://www.unicode.org/Public/UNIDATA/Blocks.txt
Thanks. :-) Looks like most of them aren't that common (mostly ancient languages and some supplementary characters), but it's still worth investigating.
Severity: trivial → minor
Comment 7•17 years ago
|
||
I'm fairly sure this is a regression: AFAIR I entered some Plane 1 characters into bug 297943 comment 8 and they appeared correctly at the time.
Comment 8•17 years ago
|
||
(In reply to comment #7)
> I'm fairly sure this is a regression: AFAIR I entered some Plane 1 characters
> into bug 297943 comment 8 and they appeared correctly at the time.
It's entirely possible that this is a problem in MySQL's handling of UTF-8, in that case, or some obscure problem with the Perl Encode module (though that's much less likely).
Comment 9•17 years ago
|
||
mkanat: is the codepage for the bmo mysql is 'utf8'?
if so, you cannot use 4-byte utf-8 at bmo. (utf8 in mysql = CESU-8)
refer http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html
> RFC 3629 describes encoding sequences that take from one to four bytes.
> Currently, MySQL support for UTF-8 does not include four-byte sequences. (An
> older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8
> sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete;
> for this reason, sequences with five and six bytes are no longer used.)
i once heard that utf8_4 will be added to mysql in the future
i don't know about pgsql :-)
Comment 10•17 years ago
|
||
(In reply to comment #9)
> mkanat: is the codepage for the bmo mysql is 'utf8'?
Yes. :-) Okay, so that's our problem--right now MySQL just doesn't support it. So unfortunately that makes this bug "INVALID" (though that's not the most accurate resolution in this case--it just means that this isn't a bug we can fix, since it's a MySQL problem).
For those who wonder "why did this work before"? It's because before, we stored raw bytes in the database instead of characters, which meant that sorting was done on raw bytes instead of the correct Unicode collation. We pulled out those raw bytes, and then Perl displayed them correctly.
The ability to correctly sort multi-byte or non-ASCII characters in the database is a big enough advantage to make it currently acceptable that we can't display supplementary characters anymore.
If there's any other solution for this that we can implement in Bugzilla itself, feel free to let me know, but for now this is something that needs to be fixed in MySQL or whatever database an installation is using.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → INVALID
Assignee | ||
Comment 13•13 years ago
|
||
I'm reopening this bug because MySQL 5.5.3 and above now support 4 bytes UTF8 characters, instead of 3 bytes characters only:
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html
This is done by using the utf8mb4 character set instead of utf8. As the conversion has some consequences on stored data, this page must be read before a decision is made:
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html
As said above, utf8mb4 requires MySQL 5.5.3, and the first stable release (GA) is 5.5.8, released on Dec 3, 2010. So if we follow this path, we would have to require 5.5.8 instead of the current 5.0.15 release. Not something we can do for 4.2. I'm also not sure that's a good idea to require MySQL 5.5.8 for Bugzilla 4.4. Maybe for the next major release (4.6).
If we don't want utf8mb4 or any other newer character set (utf16 or utf32), then the bug should be closed as WONTFIX.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee | ||
Updated•13 years ago
|
Assignee: create-and-change → database
Component: Creating/Changing Bugs → Database
Keywords: intl
Summary: Text is cut off when containing supplementary characters (outside BMP) → Text is cut off when containing supplementary characters (outside BMP) with MySQL as backend
Comment 14•13 years ago
|
||
The biggest problem would be:
"InnoDB has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters."
Comment 15•12 years ago
|
||
This bug should get higher priority now. Unicode 6.0 added emoji characters, which are hugely popular in some parts of the world, and most of them are supplementary characters in Unicode. We need to be able to track bugs involving them. Also, we're working on solid support for supplementary characters in ECMAScript 6, and need to be able to track issues related to that. See
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
https://bugs.ecmascript.org/show_bug.cgi?id=610
Severity: minor → normal
Assignee | ||
Updated•12 years ago
|
Assignee | ||
Comment 18•12 years ago
|
||
(In reply to Max Kanat-Alexander from comment #14)
> The biggest problem would be:
>
> "InnoDB has a maximum index length of 767 bytes, so for utf8 or utf8mb4
> columns, you can index a maximum of 255 or 191 characters, respectively.
We don't need to enable 4 bytes Unicode characters for all fields. All duplicates are about adding such characters in comments only. I think all we need is to only use utf8mb4 for the longdescs.thetext, bugs_fulltext.comments and bugs_fulltext.comments_noprivate DB columns. longdescs.thetext has no index and so the new limitation you mention is not an issue. About the bugs_fulltext table, its type is MyISAM and columns are much larger than 767 bytes anyway.
Status: REOPENED → NEW
Comment 19•12 years ago
|
||
If fully supporting non-BMP code points is a bunch of work, I'd suggest that there would be value in implementing a short-term workaround to strip such characters from comments. Part of my surprise in bug 808241 was such characters *truncate* the comments (vs simply disappearing or being garbled).
Updated•12 years ago
|
Summary: Text is cut off when containing supplementary characters (outside BMP) with MySQL as backend → Text is cut off when containing unicode supplementary characters (outside BMP) with MySQL as backend
Assignee | ||
Updated•12 years ago
|
Whiteboard: [requires MySQL 5.5.8, see comment 13]
Assignee | ||
Comment 21•12 years ago
|
||
Comment 22•12 years ago
|
||
Comment on attachment 734233 [details] [diff] [review]
patch, v1
Review of attachment 734233 [details] [diff] [review]:
-----------------------------------------------------------------
::: Bugzilla/Comment.pm
@@ +305,5 @@
> + # without any problem. So we need to remove these characters if we use MySQL,
> + # else the comment is truncated.
> + # XXX - Once we use utf8mb4 for comments, this restriction for MySQL can go away.
> + if (Bugzilla->dbh->isa('Bugzilla::DB::Mysql')) {
> + $thetext =~ s/[\x{10000}-\x{10FFFF}]/?/g;
This proposed fix destroys information that the bug reporter wants to entrust bugzilla, and which may well be relevant to the bug being reported (see, for example, bug 857438).
The fix should apply a transformation that preserves the information, and restore the original character when retrieving it from the database. One possibility might be to use CESU-8, that is, represent the character by first encoding it in UTF-16, then mapping the two UTF-16 code units separately to UTF-8.
Or, of course, use a database that's not broken.
Attachment #734233 -
Flags: review-
Assignee | ||
Comment 23•12 years ago
|
||
Comment on attachment 734233 [details] [diff] [review]
patch, v1
You are not a Bugzilla reviewer.
Moreover, we cannot use another character set unless we require MySQL 5.5.8 or newer, and as I said in comment 13, we currently support MySQL 5.0 too. Using another character set is something which can only be done in Bugzilla 5.0 at the earliest. This will be done in a separate bug.
Attachment #734233 -
Flags: review-
Comment 24•12 years ago
|
||
I think this is a fair trade off result to replace only >FFFF chars.
Just a comment, shall we check with Bugzilla->localconfig or dbh as in the patch?
Comment 25•12 years ago
|
||
So EITHER replace the character with something that at least shows the bug reporter's intent, I don't paticularly care whether that is CESU-8 or simply U+HEXVAL, or block submission until the offending character is removed (with a sensible explanation, please).
Replacing the character with a question mark destroys information. The original problem would have been rendered as unintelligible by this change as it has been by truncation.
Assignee | ||
Comment 26•12 years ago
|
||
(In reply to shimono +sbo from comment #24)
> Just a comment, shall we check with Bugzilla->localconfig or dbh as in the
> patch?
dbh, because it calls Bugzilla->localconfig for us but also does extra checks.
Comment 27•12 years ago
|
||
Oh yes: if you really really think you need to replace the damn character, then *please* use U+FFFD, not a question mark.
Assignee | ||
Comment 28•12 years ago
|
||
(In reply to Matthias Urlichs from comment #25)
> So EITHER replace the character with something that at least shows the bug
> reporter's intent, I don't paticularly care whether that is CESU-8 or simply
> U+HEXVAL, or block submission until the offending character is removed (with
> a sensible explanation, please).
Blocking submission is not an option. Replacing the character by U+HEXVAL might be an option.
> Replacing the character with a question mark destroys information. The
> original problem would have been rendered as unintelligible by this change
> as it has been by truncation.
That's not true. Replacing one offending character by ? is much better than loosing a whole part of the comment.
Comment 29•12 years ago
|
||
As a fallback until the characters can be properly preserved, would it work to define a function such as
sub hexentity() {
return "&#x" . sprintf("%04x", ord(shift)) . ";";
}
and then use
$thetext =~ s/([\x{10000}-\x{10FFFF}])/&hexentity($1)/eg;
?
(The idea here is that if someone wants to see the actual characters, they'll be able to easily copy the text from the comment into an HTML doc.)
Comment 30•12 years ago
|
||
Hi,
>
> > Replacing the character with a question mark destroys information. The
> > original problem would have been rendered as unintelligible by this change
> > as it has been by truncation.
>
> That's not true. Replacing one offending character by ? is much better than
> loosing a whole part of the comment.
My comment was specific to the original bug report and was not inended to be generic.
"Schrödinger's ?" does not evoke any association of a character set problem.
"Schrödinger's �" certainly does.
Comment 31•12 years ago
|
||
Hi,
> return "&#x" . sprintf("%04x", ord(shift)) . ";";
How about simply
> return sprintf("&#x%04x", ord(shift)) . ";";
instead? ;-)
Comment 32•12 years ago
|
||
Sure, why not? Though surely you meant
return sprintf("&#x%04x;", ord(shift));
:)
Assignee | ||
Comment 33•12 years ago
|
||
I convert characters above U+FFFF into their U+..... hexadecimal format.
Attachment #734233 -
Attachment is obsolete: true
Attachment #734233 -
Flags: review?(dkl)
Attachment #734370 -
Flags: review?(dkl)
Comment 34•12 years ago
|
||
Comment on attachment 734370 [details] [diff] [review]
patch, v1.1
Review of attachment 734370 [details] [diff] [review]:
-----------------------------------------------------------------
::: Bugzilla/Comment.pm
@@ +305,5 @@
> + # without any problem. So we need to replace these characters if we use MySQL,
> + # else the comment is truncated.
> + # XXX - Once we use utf8mb4 for comments, this hack for MySQL can go away.
> + if (Bugzilla->dbh->isa('Bugzilla::DB::Mysql')) {
> + $thetext =~ s/([\x{10000}-\x{10FFFF}])/uc sprintf('U+%04x', ord($1))/eg;
Does this actually work correctly? In the case where there are multiple characters to be replaced? At least when testing an expression like this locally (in perl v5.12.4), I don't think the use of $1 here gives the right result.
Secondly, replacing with U+XXXX will give ugly results if two such characters occur in succession, or indeed if they're adjacent to anything except whitespace. The &#xXXXX; form would be much clearer, IMO, as it includes a visible terminator for the hex chars.
Assignee | ||
Comment 35•12 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #34)
> Does this actually work correctly? In the case where there are multiple
> characters to be replaced?
Yes, it works correctly, even with multiple characters to replace.
> Secondly, replacing with U+XXXX will give ugly results if two such
> characters occur in succession
I don't really care about this problem. And it's IMO much clearer than the &#xXXXX; notation.
Comment 36•12 years ago
|
||
(In reply to Frédéric Buclin from comment #35)
> > Secondly, replacing with U+XXXX will give ugly results if two such
> > characters occur in succession
>
> I don't really care about this problem. And it's IMO much clearer than the
> &#xXXXX; notation.
The times I've run into this bug have been when I've tried to post a comment that includes sample text using supplementary-plane characters.
I'd argue that 𐐧𐑌𐐮𐐿𐐬𐐼 is a much more usable result than U+10427U+1044CU+1042EU+1043FU+1042CU+1043C for an attempt to include a Deseret word in a comment.
Assignee | ||
Comment 37•12 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #36)
> I'd argue that 𐐧𐑌𐐮𐐿𐐬𐐼 is a
> much more usable result than U+10427U+1044CU+1042EU+1043FU+1042CU+1043C for
> an attempt to include a Deseret word in a comment.
Your example uses HTML entities. But you can view comments in text, html, xml or any other format. I see no reason to use something which is HTML-specific, though I understand your point.
Comment 38•12 years ago
|
||
At the very least, there should probably be some delimiter surrounding the substitute. "[U+10427][U+1044C][U+1042E][U+1043F][U+1042C][U+1043C]" is much better than "U+10427U+1044CU+1042EU+1043FU+1042CU+1043C".
But I think U+FFFE is a more "appropriate" solution, despite the loss of data: Replacing a single character with multiple characters could cause other problems.
Comment 39•12 years ago
|
||
(In reply to Gordon P. Hemsley [:gphemsley] from comment #38)
> U+FFFE
Sorry, I meant U+FFFD, the replacement character.
Assignee | ||
Comment 40•12 years ago
|
||
(In reply to Gordon P. Hemsley [:gphemsley] from comment #38)
> At the very least, there should probably be some delimiter surrounding the
> substitute. "[U+10427][U+1044C][U+1042E][U+1043F][U+1042C][U+1043C]" is much
> better than "U+10427U+1044CU+1042EU+1043FU+1042CU+1043C".
I like this idea! :)
Attachment #734370 -
Attachment is obsolete: true
Attachment #734370 -
Flags: review?(dkl)
Attachment #734414 -
Flags: review?(dkl)
Comment 41•12 years ago
|
||
Should we file a separate bug to either a) Have Bugzilla require MySQL 5.8.8 and use utf8mb4 for comments, or b) have an option to do so, which bugzilla.mozilla.org can enable? I note that the release notes say that performance is/was not as good for utf8mb4; perhaps we should consult some Mozilla DBAs to ask what the current situation is.
Another option is to find a way of storing the character in escaped form and unescaping it once it's been extracted from the DB. This would mean you couldn't as easily search for all comments containing emoji, but it would mean we could make the emoji show up correctly in a web browser or other Bugzilla client.
Gerv
Assignee | ||
Comment 42•12 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #41)
> Should we file a separate bug to either a) Have Bugzilla require MySQL 5.8.8
> and use utf8mb4 for comments
That's the plan, yes (5.5.8, not 5.8.8). But this will only land in a future major release (5.x) as a DB schema change is not acceptable for stable branches. This bug is about comments being truncated, and so you must consider my patch as a workaround, not as the best solution, which is to use utf8mb4.
> or b) have an option to do so, which bugzilla.mozilla.org can enable?
No need. bmo will probably backport the patch.
> I note that the release notes say that
> performance is/was not as good for utf8mb4; perhaps we should consult some
> Mozilla DBAs to ask what the current situation is.
I agree. That's why we will need some time to write the patch and test it.
> Another option is to find a way of storing the character in escaped form and
> unescaping it once it's been extracted from the DB.
We could easily detect strings of the form [U+....] and convert them back when displaying comments. But this means this hack should stay forever, even after the move to utf8mb4, to correctly render old comments. That's a bit annoying. But that's doable. I will give it a try.
Comment 43•12 years ago
|
||
(In reply to Frédéric Buclin from comment #42)
> > Another option is to find a way of storing the character in escaped form and
> > unescaping it once it's been extracted from the DB.
>
> We could easily detect strings of the form [U+....] and convert them back
> when displaying comments. But this means this hack should stay forever, even
> after the move to utf8mb4, to correctly render old comments. That's a bit
> annoying. But that's doable. I will give it a try.
We don't want to do that, because those strings are not unique; we may find people have talked about particular Unicode characters in bugs and their references to them are suddenly replaced with the character itself!
I think we either need to come up with an escaping system where it's not possible to type an escape sequence (which might be a bit tricky) or we should just go with the hack until we can change the charset. After all, I suspect there are few Bugzillas using these chars at the moment.
One way might be to use a special marker char before the [U+XXXX] thing - one from the private use area, if there is a PUA in the BMP. We can then filter them out on comment display.
Gerv
Assignee | ||
Comment 44•12 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #43)
> We don't want to do that, because those strings are not unique; we may find
> people have talked about particular Unicode characters in bugs and their
> references to them are suddenly replaced with the character itself!
It's all about compromise here. I don't want to insert special escape character, because they will stay forever. And if someone uses the exact same syntax as the one I proposed, well... no luck!
Assignee | ||
Comment 45•12 years ago
|
||
The [U+....] notation is now invisible to the user. That's the format used to store the comment into the DB, but they are converted back on the fly when displayed to the user.
Attachment #734414 -
Attachment is obsolete: true
Attachment #734414 -
Flags: review?(dkl)
Attachment #734542 -
Flags: review?(dkl)
Comment 46•12 years ago
|
||
I searched the DB, and no bugs on bugzilla.mozilla.org and there are no occurrences of the string
[U+XXXX]
where the Xs are all digits. So that's good.
Gerv
Comment 47•12 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #46)
> I searched the DB, and no bugs on bugzilla.mozilla.org and there are no
> occurrences of the string
>
> [U+XXXX]
>
> where the Xs are all digits. So that's good.
Did you check for "digits" or "hex digits"?
If the latter, you ought to have found at least -this- bug; see comments 38 and 40, which (ironically) will no longer display correctly if this change lands.
Comment 48•12 years ago
|
||
Indeed, I think your search must have been faulty - see for example bug 500293 comment 0, which mentions [U+2044].
There are also several examples in bug 446112 (note that the discussion there will become virtually impossible to understand if they're "translated" to the literal Unicode characters concerned) and in bug 158835.
Assignee | ||
Comment 49•12 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #48)
> Indeed, I think your search must have been faulty - see for example bug
> 500293 comment 0, which mentions [U+2044].
>
> There are also several examples in bug 446112 (note that the discussion
> there will become virtually impossible to understand if they're "translated"
> to the literal Unicode characters concerned) and in bug 158835.
People should first read how the patch works before complaining. Only characters above U+FFFF will be converted. All the examples you gave are below FFFF and will be left unaltered.
Comment 50•12 years ago
|
||
Sorry, I was looking at Gerv's comment re his search, which mentioned [U+XXXX]. OK, so that greatly reduces the potential impact. Quite likely only the discussion here would be affected, though I don't have much confidence in my bugzilla-searching skills.
Comment 51•12 years ago
|
||
>> One way might be to use a special marker char before the [U+XXXX] thing - one from the private use area, if there is a PUA in the BMP. We can then filter them out on comment display.
If we go that way, we might as well just use the CESU-8 method.
I think it's a much better workaround than using something that might show up in a legitimate comment.
Can somebody please check whether CESU-8 would make it through older MySQL unaltered?
Assignee | ||
Comment 52•12 years ago
|
||
(In reply to Matthias Urlichs from comment #51)
> Can somebody please check whether CESU-8 would make it through older MySQL
> unaltered?
Per http://dev.mysql.com/worklog/task/?id=1213:
"MySQL will not support CESU-8 or Java Modified UTF-8."
Comment 53•12 years ago
|
||
"Does not support" can mean "mangles beyond recognition" (like the way it treats characters >BMP right now :-( ), or just "passes through but won't sort or search correctly" (which in turn can mean 'not case-independent' or 'not at all').
Comment 54•12 years ago
|
||
Having experimented, it turns out the MySQL regular expression engine doesn't support "\d". Which is why my search failed. A new search for
\[U\+[0-9a-fA-F]+\]
turns up only this bug.
Gerv
Assignee | ||
Comment 55•12 years ago
|
||
Comment 56•12 years ago
|
||
<sigh> You are quite right. I limited it to the Bugzilla product to test my regexp and forgot to remove the restriction.
So I think the "escape sequence" should be a bit less common, e.g.
<[U+XXXX]>
That has no hits.
Gerv
Assignee | ||
Comment 57•12 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #56)
> So I think the "escape sequence" should be a bit less common, e.g.
>
> <[U+XXXX]>
>
> That has no hits.
No! First of all, you are Mozilla-centric. Because bmo does or doesn't do something, that's fine and is extendable to all other Bugzilla installations. Secondly, there is no matches for [U+XXXXX] nor [U+XXXXXX]. All the bugs returned by the URL in comment 55 are only for code points below U+FFFF, which my patch ignores.
Comment 58•12 years ago
|
||
And have you checked "all other Bugzilla installations" to make sure people have -not- used [U+XXXXX] to refer to supplementary-plane codepoints?
A safer and more reliable option would be to "mark" these sequences with noncharacter code points from the U+FDD0..U+FDEF area, which by definition are not supposed to be interchanged; they are "permanently reserved ... for internal use". E.g. replace supplementary-plane characters with a sequence such as \x{FDD0}[U+XXXXX]\x{FDD1}.
(The visible delimiters "[U+" and "]", though not strictly needed, are included here for better readability in the case that anyone needs to examine the raw data as actually stored in the db; the \x{FDD0} and \x{FDD1} delimiters would most likely be either invisible or rendered as "missing-character" boxes in most environments.)
See section 16.7 in http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf for details of the defined noncharacter codepoints.
Comment 59•12 years ago
|
||
DBA Here. :)
justdave brought me in to clarify some things you guys may be facing that may be best answered from a data perspective.
Here's a few points.
- You cannot store supplementary characters in mysql using utf8. If you try to, you'll only store the first 3 bytes and the 4th will be dropped. That is why, in order to solve this, you will have to use the solution of going to utf8mb4 and mysql 5.5.8.
- Because of the rule above, you don't have any data translation issues to worry about when upgrading from utf8 to utf8mb4.
- The maximum length of an index (prior to MySQL 5.6.3) is 767 bytes. In 5.6.3 it becomes 3072 bytes.
- When the ROW_FORMAT = DYNAMIC (what we use in all of our databases I've seen, and is the recommended/default option)
- utf8 uses 1-3 bytes per character depending upon the character (latin characters use 1, BMP use 3, etc.)
- utf8mb4 uses 1-4 bytes per character depending upon the character...
- The warning you get about the 191 character limit in utf8mb4 is ONLY a warning. What it's telling you is this: "If the data exceeds 767 bytes, we will only store 767 bytes in the index. Oh, by the way, since you're using utf8mb4, you can potentially do that in only 191 characters."
- You get the same warning if you try and make a 256+ character varchar in utf8.
- This is only a warning. It still stores up to 767 bytes of data, always. (until 5.6.3+)
Hope that clears things up! Also, please note you can always get any questions answered by a DBA by popping over and saying hello on #db in IRC. I promise we don't bite. :)
Assignee | ||
Comment 60•12 years ago
|
||
After some discussion on IRC with other core Bugzilla developers, we agreed with the proposal made in comment 58, and so I added \x{FDD0} and \x{FDD1} as delimiters. Moreover, I updated my patch so that these characters can be searched for in queries.
Attachment #734542 -
Attachment is obsolete: true
Attachment #734542 -
Flags: review?(dkl)
Attachment #735268 -
Flags: review?(glob)
Attachment #735268 -
Flags: review?(dkl)
Comment 61•12 years ago
|
||
So, with this patch applied, the following should output as LINEAR B SYLLABLE B008 A (U+10000):
[U+10000]
Assignee | ||
Comment 62•11 years ago
|
||
Bugzilla 4.2 is now restricted to security fixes only.
Target Milestone: Bugzilla 4.2 → Bugzilla 4.4
Comment 63•11 years ago
|
||
dkl/glob: ping on this review :-)
Gerv
Updated•11 years ago
|
Assignee | ||
Updated•11 years ago
|
Target Milestone: Bugzilla 4.4 → Bugzilla 5.0
Comment 64•11 years ago
|
||
Comment on attachment 735268 [details] [diff] [review]
patch, v2
r=gerv, if you remove the XXXs, which should never be checked in.
dkl or glob: do you want to review this too? We should get this checked in and then deployed on BMO, for the reasons Norbert outlines above. BMO users want to use characters outside the Basic Multilingual Plane in Bugzilla comments.
Gerv
Attachment #735268 -
Flags: review?(dkl) → review+
Comment 65•11 years ago
|
||
Comment on attachment 735268 [details] [diff] [review]
patch, v2
i don't think this needs more than one review
Attachment #735268 -
Flags: review?(glob)
Assignee | ||
Comment 66•11 years ago
|
||
If we fix bug 868867, this fix isn't needed. justdave, what do you decide?
Flags: approval?
Comment 67•11 years ago
|
||
We have a while until enough distributions are shipping MySQL 5.5 for upgrading to it to be a viable option, so lets go ahead and do this. We can back this out on trunk (and convert all the affected data back on upgrade) in bug 868867 when that condition occurs.
Flags: approval? → approval+
Assignee | ||
Comment 68•11 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #64)
> r=gerv, if you remove the XXXs, which should never be checked in.
It's pretty common to put XXX where the code is clearly a hack. See all the places in the Bugzilla codebase where we use it. It's much easier to track hacks when we need to clean up the code. And as justdave said in comment 67, this code must be reverted once bug 868867 is fixed.
Keywords: relnote
Comment 69•11 years ago
|
||
You may argue it's too late for Bugzilla, but I'd say that normal convention is that "XXX" means: "something I should fix before checking this code in", and other annotations like "TODO" indicate: "we are living with this for now, but it should change later.
Given that we have a bug which records the future work needed, I'm not sure we also need XXX.
Gerv
Assignee | ||
Comment 70•11 years ago
|
||
Committing to: bzr+ssh://lpsolit%40gmail.com@bzr.mozilla.org/bugzilla/trunk/
modified Bugzilla/Bug.pm
modified Bugzilla/Comment.pm
modified Bugzilla/Search.pm
Committed revision 8929.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 71•11 years ago
|
||
Perl 5.13.8 and older complain that "Unicode non-character 0xfdd0 is illegal for interchange". Same for 0xfdd1. This bug has been fixed in Perl 5.13.9:
"Unicode non-characters, some of which previously were erroneously considered illegal in places by Perl, contrary to the Unicode standard, are now always legal internally."
http://perldoc.perl.org/perlunicode.html#Non-character-code-points states that:
"The non-character code points are the 32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. Some people are under the mistaken impression that these are "illegal", but that is not true. An application or cooperating set of applications can legally use them at will internally; but these code points are "illegal for open interchange". Therefore, Perl will not accept these from input streams unless lax rules are being used, and will warn (using the warning category "nonchar", which is a sub-category of "utf8") if an attempt is made to output them."
Looks like I have to disable these warnings for 5.10 and 5.12.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 72•11 years ago
|
||
Attachment #8381730 -
Flags: review?(gerv)
Comment 73•11 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #69)
> Given that we have a bug which records the future work needed, I'm not sure
> we also need XXX.
FWIW, as a reviewer I've used a policy which I think came from JS-land / Brendan: XXX and TODO are fine if the comment mentions the bug you've filed to address the issue later. That gives future people a clear place to go for more context / discussion / fixes, and helps to encourage either fixing the problem before landing (to make the comment unnecessary), or making sure there's a bug in the _bug tracker_ so that issues don't get lost.
Comment 74•11 years ago
|
||
Comment on attachment 8381730 [details] [diff] [review]
disable warnings in Perl < 5.13.9
http://perldoc.perl.org/functions/state.html says you need a "use feature state" to use state, and grepping the current trunk doesn't show one...
In another review, you said "state" was not allowed in 4.4 and below. I presume this patch is only for trunk? If so, that's fine.
Gerv
Attachment #8381730 -
Flags: review?(gerv) → review+
Assignee | ||
Comment 76•11 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #74)
> http://perldoc.perl.org/functions/state.html says you need a "use feature
> state" to use state, and grepping the current trunk doesn't show one...
|use 5.10.1| enables all supported features, including "state".
> In another review, you said "state" was not allowed in 4.4 and below. I
> presume this patch is only for trunk? If so, that's fine.
Yes, trunk only, as shown by the target milestone of this bug. :)
Committing to: bzr+ssh://lpsolit%40gmail.com@bzr.mozilla.org/bugzilla/trunk/
modified Bugzilla/Bug.pm
modified Bugzilla/Comment.pm
modified Bugzilla/Search.pm
Committed revision 8932.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 78•11 years ago
|
||
I was just affected by this in bug 998348. I tried to enter emoji characters (https://www.dropbox.com/s/uibcwoumgir40tg/Screenshot%202014-04-18%2011.35.51.png) but the text was cut off (https://www.dropbox.com/s/lv9pk1guw7m1ec2/Screenshot%202014-04-18%2011.36.58.png).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 79•11 years ago
|
||
bmo is running 4.2.7. This bug is targetted 5.0. Upgrade bmo to 5.0! :)
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Comment 88•8 years ago
|
||
This bug is marked RESOLVED FIXED. Is this instance of Bugzilla (as in, the one I am commenting on right now) using a version that does not have a fix? Because high Unicode characters (e.g. emoji) still seem cause comments to be truncated, rather than being replaced, as I ran into in bug 1054780 today.
For instance, I will attach a screenshot of this comment before submission, which after submission should end up being cut off right after this pizza (U+1F355):
Comment 89•8 years ago
|
||
And indeed: this is what the submit box looked like before I hit Submit, which resulted in the comment above. Text after the pizza being truncated, rather than the pizza simply being omitted or substituted.
Assignee | ||
Comment 90•8 years ago
|
||
(In reply to cincodenada from comment #88)
> This bug is marked RESOLVED FIXED. Is this instance of Bugzilla (as in, the
> one I am commenting on right now) using a version that does not have a fix?
This installation is not yet running Bugzilla 5.x, no. The upgrade progress can be seen in bug 974387.
Comment 91•8 years ago
|
||
(In reply to cincodenada from comment #88)
> This bug is marked RESOLVED FIXED. Is this instance of Bugzilla (as in, the
> one I am commenting on right now) using a version that does not have a fix?
> Because high Unicode characters (e.g. emoji) still seem cause comments to be
> truncated, rather than being replaced, as I ran into in bug 1054780 today.
>
> For instance, I will attach a screenshot of this comment before submission,
> which after submission should end up being cut off right after this pizza
> (U+1F355):
The fix we are pursuing for BMO differs from the way this was fixed in upstream Bugzilla.
Requiring newer mysql and switching utf8mb4.
The work in progress patch is https://github.com/mozilla-bteam/bmo/pull/22.
The only downside of this is indexes (but not columns!) become limited to a smaller number of characters.
Comment 92•8 years ago
|
||
Aha, thanks for the details and pointer to the relevant bug. Glad to see it's in progress here at BMO, apologies for posting on the wrong project!
You need to log in
before you can comment on or make changes to this bug.
Description
•