Closed
Bug 363153
(bz-utf8)
Opened 18 years ago
Closed 17 years ago
Prepare Bugzilla so that string functions like substr() work with UTF-8 characters
Categories
(Bugzilla :: Database, enhancement, P1)
Tracking
()
RESOLVED
FIXED
Bugzilla 3.2
People
(Reporter: Wurblzap, Assigned: mkanat)
References
()
Details
Attachments
(1 file, 4 obsolete files)
(deleted),
patch
|
Wurblzap
:
review+
|
Details | Diff | Splinter Review |
String functions like substr() seem to work on bytes instead of on characters. This doesn't matter as long as characters are one byte each. Non-English languages frequently use multiple-byte characters, though, and these occasionally get cut in half (for example in abbreviated columns in bug lists), resulting in dummy replacement characters in display. Strings need to have Perl's Unicode flag set so that functions like substr() work on characters instead of on bytes. This means that all sources Bugzilla gets its strings from (database, environment (?), CGI, templates, code constants) need to do this. For databases, http://aspn.activestate.com/ASPN/docs/ActivePerl/5.8/lib/Pod/perlunicode.html implies that stuff needs to be done to achieve this (in the "A wrapper for fetchrow_array and fetchrow_hashref" section). Other sources may do it by default.
Assignee | ||
Comment 2•18 years ago
|
||
Okay, I looked into it more, and this could have some far-reaching implications. It's actually likely to work just fine, but the problem is that some perl modules may not understand utf8, and may "downgrade" any particular string to be not utf8. Because I don't know which modules will do this, I don't want to make this change while we're frozen. If there are modules that do this, it could cause strange bugs that take forever to track down, and are hard to fix. I don't want to have to extensively change our module requirements while we're frozen. So, we'll just have to live with this in 3.0 and wait for 3.2 to fix it.
Flags: blocking3.0+ → blocking3.0-
Priority: -- → P1
Target Milestone: Bugzilla 3.0 → Bugzilla 3.2
Assignee | ||
Updated•18 years ago
|
Assignee: general → mkanat
Assignee | ||
Comment 3•18 years ago
|
||
Okay, for reference, this can be enabled in DBD::mysql with: $dbh->{mysql_enable_utf8} = 1 And in DBD::Pg with: $dbh->{pg_enable_utf8} = 1 (We don't need to do it in Pg yet, though, since I don't think we store things there as UTF-8, yet.)
Assignee | ||
Updated•17 years ago
|
Severity: normal → enhancement
Component: Bugzilla-General → Database
Assignee | ||
Comment 7•17 years ago
|
||
Okay, this is a very simple patch, but it could have far-reaching effects, as I mentioned above. We need to test Bugzilla everywhere with multi-byte data, with this patch, particularly places that process strings or pass strings from the DB to CPAN modules. I have a test installation (on MySQL) with this patch applied, here: http://landfill.bugzilla.org/utf8/ We also should test it on PgSQL at some point.
Attachment #272961 -
Flags: review?
Assignee | ||
Comment 8•17 years ago
|
||
Fixed the Pg part. No other changes.
Attachment #272961 -
Attachment is obsolete: true
Attachment #272963 -
Flags: review?
Attachment #272961 -
Flags: review?
Assignee | ||
Comment 9•17 years ago
|
||
Comment on attachment 272963 [details] [diff] [review] v2 Something is wrong here, because I can pull a multibyte summary from the DB, but it doesn't have the utf8 flag on.
Attachment #272963 -
Flags: review? → review-
Assignee | ||
Comment 10•17 years ago
|
||
Okay, so the problem was that mysql_enable_utf8 actually requires DBD-mysql 4.0.
Attachment #272963 -
Attachment is obsolete: true
Attachment #272966 -
Flags: review?
Comment 11•17 years ago
|
||
Note that DBD::mysql 4.000 is not yet available in all distros, including on Windows. And not everybody agrees to use CPAN. So we shouldn't require such a high version as a minimum.
Comment 12•17 years ago
|
||
Comment on attachment 272966 [details] [diff] [review] v3 >Index: Bugzilla/Constants.pm >=================================================================== >- # Certain versions are broken, development versions are >- # always disallowed. >- blacklist => ['^3\.000[3-6]', '_'], You shouldn't remove development versions from the blacklist so leave the '_' alone.
Comment 13•17 years ago
|
||
Comment on attachment 272966 [details] [diff] [review] v3 This breaks product names: - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are replaced with unknown char symbol). - Product with non-latin chars: "undef error - Wide character in subroutine entry at .../HTML/Scrubber.pm line 322." - Any link or function that gets a non-latin product name ends up not finding the product. Adding comment with non-latin characters throws an error: "undef error - Wide character in subroutine entry at .../Email/MIME/Encodings.pm line 26." but does add the comment. So this patch clearly makes things worse instead of fixing anything. Besides, this attribute has warning "This option is experimental and may change in future versions" so I'm not sure we should be using it in the first place. Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES" for the MySQL DB is not needed? What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals with utf8 string correctly?
Attachment #272966 -
Flags: review? → review-
Assignee | ||
Comment 14•17 years ago
|
||
(In reply to comment #13) > (From update of attachment 272966 [details] [diff] [review]) > This breaks product names: > - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are > replaced with unknown char symbol). And you're running in UTF-8 mode and you ran recode.pl and your DB is encoded correctly? > - Product with non-latin chars: "undef error - Wide character in subroutine > entry at .../HTML/Scrubber.pm line 322." Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It should, though...I thought we required a particular version specifically for that reason. > Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES" > for the MySQL DB is not needed? Yes. > What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals > with utf8 string correctly? No, that was only required in Perl 5.6.
Comment 15•17 years ago
|
||
(In reply to comment #14) > And you're running in UTF-8 mode and you ran recode.pl and your DB is > encoded correctly? Yes, UTF8 mode is enabled. Data came from bugzilla-tip as is but I did run checksetup.pl that converted copied database to UTF8. Problems happened with new non-latin product I just created for this test. The non-ASCII product was already there. I should mention that without this patch, things I mentioned work again.
Comment 16•17 years ago
|
||
(In reply to comment #14) > > - Product with non-latin chars: "undef error - Wide character in subroutine > > entry at .../HTML/Scrubber.pm line 322." > > Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It > should, though...I thought we required a particular version specifically for > that reason. I'm not sure HTML::Scrubber is the culprit, by looking at its source code. We require HTML::Parser 3.40 or newer so that we can enable the utf8_mode parameter when running on Perl 5.8 or newer, which is always the case since Bugzilla 3.0. Here is what HTML::Parser says about this utf8_mode parameter: =item $p->utf8_mode( $bool ) Enable this option when parsing raw undecoded UTF-8. This tells the parser that the entities expanded for strings reported by C<attr>, C<@attr> and C<dtext> should be expanded as decoded UTF-8 so they end up compatible with the surrounding text. If C<utf8_mode> is enabled then it is an error to pass strings containing characters with code above 255 to the parse() method, and the parse() method will croak if you try. Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8 encoded. The character can also be represented by the entity "♥" or "♥". If we feed the parser: $p->parse("\xE2\x99\xA5♥"); then C<dtext> will be reported as "\xE2\x99\xA5\x{2665}" without C<utf8_mode> enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled. The later string is what you want. This option is only available with perl-5.8 or better.
Assignee | ||
Updated•17 years ago
|
Alias: bz-utf8
Assignee | ||
Comment 17•17 years ago
|
||
Okay, apparently the problem is that STDOUT does not have "binmode STDOUT, ':utf8'" set appropriately. I don't know if this is something we should be expecting CGI to do for us, or if it's something we really ought to be doing, but it certainly seems to make things work right with this patch.
Assignee | ||
Comment 18•17 years ago
|
||
Okay, this version fixes the display and searching of things with utf8 in them, but the administrative pages are still broken ("no such product"), so I have to look into that.
Attachment #272966 -
Attachment is obsolete: true
Assignee | ||
Updated•17 years ago
|
Assignee | ||
Comment 19•17 years ago
|
||
Okay, here we go! This version works in every area of Bugzilla that I could test. Here are the things that had to be fixed: 1. The DB has to return utf-8 and accept utf-8 strings. 2. The url_quote filter had to be fixed to correctly quote utf-8 strings. Note that in Template-Toolkit 2.16 and above, the "uri" filter already does this for us, but we're not using that yet. 3. CGI.pm had to be fixed to correctly return utf8 strings instead of bytes. 4. We have to set "binmode STDOUT, ':utf8'" whenever we're going to output something. In this patch, we do it correctly in the CGI. checksetup.pl or other command-line scripts may need some other way to set it.
Attachment #286104 -
Attachment is obsolete: true
Attachment #286357 -
Flags: review?(wurblzap)
Assignee | ||
Updated•17 years ago
|
Attachment #286357 -
Flags: review?(wicked)
Comment 20•17 years ago
|
||
(In reply to comment #19) > 4. checksetup.pl > or other command-line scripts may need some other way to set it. Do you mean bug 382398 comment 1?
Assignee | ||
Comment 21•17 years ago
|
||
(In reply to comment #20) > Do you mean bug 382398 comment 1? Yeah, that's the right bug, thank you. :-) We'll fix that there, not here.
Reporter | ||
Comment 22•17 years ago
|
||
Comment on attachment 286357 [details] [diff] [review] v5 Great patch, both in terms how well it works and how it's done! I neither checked nor looked at email_in.pl. Pg is out of scope for me as well. >RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v >- version => '2.9003', >- # Certain versions are broken, development versions are >- # always disallowed. >- blacklist => ['^3\.000[3-6]', '_'], >+ # For UTF-8 support >+ version => '4.00', Don't we want to blacklist development versions any more? If this is a glitch, then it can be fixed on checkin. >RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v > sub url_quote { > my ($toencode) = (@_); >+ utf8::encode($toencode) # The below regex works only on bytes >+ if Bugzilla->params->{'utf8'} && utf8::is_utf8($toencode); > $toencode =~ s/([^a-zA-Z0-9_\-.])/uc sprintf("%%%02x",ord($1))/eg; > return $toencode; > } This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It looks strange to me, though -- please reassure me that this is legal :)
Attachment #286357 -
Flags: review?(wurblzap) → review+
Assignee | ||
Comment 23•17 years ago
|
||
(In reply to comment #22) > Don't we want to blacklist development versions any more? If this is a glitch, > then it can be fixed on checkin. Yeah, it's a glitch. I'll fix that on checkin. > This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It > looks strange to me, though -- please reassure me that this is legal :) Hahaha. Yes, it is legal. :-) Those are the byte values, which we reassemble into utf-8 when they come back through CGI.pm.
Assignee | ||
Updated•17 years ago
|
Attachment #286357 -
Flags: review?(wicked)
Assignee | ||
Updated•17 years ago
|
Flags: approval+
Assignee | ||
Comment 24•17 years ago
|
||
Checking in Bugzilla.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v <-- Bugzilla.pm new revision: 1.62; previous revision: 1.61 done Checking in email_in.pl; /cvsroot/mozilla/webtools/bugzilla/email_in.pl,v <-- email_in.pl new revision: 1.10; previous revision: 1.9 done Checking in Bugzilla/CGI.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/CGI.pm,v <-- CGI.pm new revision: 1.34; previous revision: 1.33 done Checking in Bugzilla/Constants.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v <-- Constants.pm new revision: 1.85; previous revision: 1.84 done Checking in Bugzilla/Mailer.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v <-- Mailer.pm new revision: 1.13; previous revision: 1.12 done Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.63; previous revision: 1.62 done Checking in Bugzilla/DB/Mysql.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v <-- Mysql.pm new revision: 1.55; previous revision: 1.54 done Checking in Bugzilla/DB/Pg.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Pg.pm,v <-- Pg.pm new revision: 1.25; previous revision: 1.24 done
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Comment 25•17 years ago
|
||
(In reply to comment #11) > Note that DBD::mysql 4.000 is not yet available in all distros, including on > Windows. And not everybody agrees to use CPAN. So we shouldn't require such a > high version as a minimum. Yup, not available for RHEL either. And that's one package nobody wants to screw with on production systems. :( Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6.
Assignee | ||
Comment 26•17 years ago
|
||
(In reply to comment #25) > Yup, not available for RHEL either. And that's one package nobody wants to > screw with on production systems. :( No, it's available from rpmforge as DBD-mysql (different capitalization). > Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6. We use it fine on landfill.
Assignee | ||
Comment 31•16 years ago
|
||
Added to the release notes for Bugzilla 3.2 in a patch on bug 432331.
Keywords: relnote
You need to log in
before you can comment on or make changes to this bug.
Description
•