Closed Bug 363153 (bz-utf8) Opened 18 years ago Closed 17 years ago

Prepare Bugzilla so that string functions like substr() work with UTF-8 characters

Categories

(Bugzilla :: Database, enhancement, P1)

2.23.3
enhancement

Tracking

()

RESOLVED FIXED
Bugzilla 3.2

People

(Reporter: Wurblzap, Assigned: mkanat)

References

()

Details

Attachments

(1 file, 4 obsolete files)

String functions like substr() seem to work on bytes instead of on characters. This doesn't matter as long as characters are one byte each. Non-English languages frequently use multiple-byte characters, though, and these occasionally get cut in half (for example in abbreviated columns in bug lists), resulting in dummy replacement characters in display.

Strings need to have Perl's Unicode flag set so that functions like substr() work on characters instead of on bytes. This means that all sources Bugzilla gets its strings from (database, environment (?), CGI, templates, code constants) need to do this. For databases, http://aspn.activestate.com/ASPN/docs/ActivePerl/5.8/lib/Pod/perlunicode.html implies that stuff needs to be done to achieve this (in the "A wrapper for fetchrow_array and fetchrow_hashref" section). Other sources may do it by default.
I'd say this is a blocker.
Flags: blocking3.0+
Okay, I looked into it more, and this could have some far-reaching implications.

It's actually likely to work just fine, but the problem is that some perl modules may not understand utf8, and may "downgrade" any particular string to be not utf8. Because I don't know which modules will do this, I don't want to make this change while we're frozen.

If there are modules that do this, it could cause strange bugs that take forever to track down, and are hard to fix. I don't want to have to extensively change our module requirements while we're frozen.

So, we'll just have to live with this in 3.0 and wait for 3.2 to fix it.
Flags: blocking3.0+ → blocking3.0-
Priority: -- → P1
Target Milestone: Bugzilla 3.0 → Bugzilla 3.2
Assignee: general → mkanat
Okay, for reference, this can be enabled in DBD::mysql with:

$dbh->{mysql_enable_utf8} = 1

And in DBD::Pg with:

$dbh->{pg_enable_utf8} = 1

(We don't need to do it in Pg yet, though, since I don't think we store things there as UTF-8, yet.)
Severity: normal → enhancement
Component: Bugzilla-General → Database
Attached patch v1 (obsolete) (deleted) — Splinter Review
Okay, this is a very simple patch, but it could have far-reaching effects, as I mentioned above.

We need to test Bugzilla everywhere with multi-byte data, with this patch, particularly places that process strings or pass strings from the DB to CPAN modules.

I have a test installation (on MySQL) with this patch applied, here:

http://landfill.bugzilla.org/utf8/

We also should test it on PgSQL at some point.
Attachment #272961 - Flags: review?
Attached patch v2 (obsolete) (deleted) — Splinter Review
Fixed the Pg part. No other changes.
Attachment #272961 - Attachment is obsolete: true
Attachment #272963 - Flags: review?
Attachment #272961 - Flags: review?
Comment on attachment 272963 [details] [diff] [review]
v2

Something is wrong here, because I can pull a multibyte summary from the DB, but it doesn't have the utf8 flag on.
Attachment #272963 - Flags: review? → review-
Attached patch v3 (obsolete) (deleted) — Splinter Review
Okay, so the problem was that mysql_enable_utf8 actually requires DBD-mysql 4.0.
Attachment #272963 - Attachment is obsolete: true
Attachment #272966 - Flags: review?
Note that DBD::mysql 4.000 is not yet available in all distros, including on Windows. And not everybody agrees to use CPAN. So we shouldn't require such a high version as a minimum.
Comment on attachment 272966 [details] [diff] [review]
v3

>Index: Bugzilla/Constants.pm
>===================================================================
>-                    # Certain versions are broken, development versions are
>-                    # always disallowed.
>-                    blacklist => ['^3\.000[3-6]', '_'],

You shouldn't remove development versions from the blacklist so leave the '_' alone.
Comment on attachment 272966 [details] [diff] [review]
v3

This breaks product names:
 - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are replaced with unknown char symbol).
 - Product with non-latin chars: "undef error - Wide character in subroutine entry at .../HTML/Scrubber.pm line 322."
 - Any link or function that gets a non-latin product name ends up not finding the product.

Adding comment with non-latin characters throws an error: "undef error - Wide character in subroutine entry at .../Email/MIME/Encodings.pm line 26." but does add the comment.

So this patch clearly makes things worse instead of fixing anything.

Besides, this attribute has warning "This option is experimental and may change in future versions" so I'm not sure we should be using it in the first place.

Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES" for the MySQL DB is not needed?

What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals with utf8 string correctly?
Attachment #272966 - Flags: review? → review-
(In reply to comment #13)
> (From update of attachment 272966 [details] [diff] [review])
> This breaks product names:
>  - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are
> replaced with unknown char symbol).

  And you're running in UTF-8 mode and you ran recode.pl and your DB is encoded correctly?

>  - Product with non-latin chars: "undef error - Wide character in subroutine
> entry at .../HTML/Scrubber.pm line 322."

  Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It should, though...I thought we required a particular version specifically for that reason.

> Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES"
> for the MySQL DB is not needed?

  Yes.

> What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals
> with utf8 string correctly?

  No, that was only required in Perl 5.6.
(In reply to comment #14)
>   And you're running in UTF-8 mode and you ran recode.pl and your DB is
> encoded correctly?

Yes, UTF8 mode is enabled. Data came from bugzilla-tip as is but I did run checksetup.pl that converted copied database to UTF8. Problems happened with new non-latin product I just created for this test. The non-ASCII product was already there. I should mention that without this patch, things I mentioned work again.
(In reply to comment #14)
> >  - Product with non-latin chars: "undef error - Wide character in subroutine
> > entry at .../HTML/Scrubber.pm line 322."
> 
>   Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It
> should, though...I thought we required a particular version specifically for
> that reason.

I'm not sure HTML::Scrubber is the culprit, by looking at its source code. We require HTML::Parser 3.40 or newer so that we can enable the utf8_mode parameter when running on Perl 5.8 or newer, which is always the case since Bugzilla 3.0.

Here is what HTML::Parser says about this utf8_mode parameter:

=item $p->utf8_mode( $bool )

Enable this option when parsing raw undecoded UTF-8.  This tells the
parser that the entities expanded for strings reported by C<attr>,
C<@attr> and C<dtext> should be expanded as decoded UTF-8 so they end
up compatible with the surrounding text.

If C<utf8_mode> is enabled then it is an error to pass strings
containing characters with code above 255 to the parse() method, and
the parse() method will croak if you try.

Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8
encoded.  The character can also be represented by the entity
"&hearts;" or "&#x2665".  If we feed the parser:

  $p->parse("\xE2\x99\xA5&hearts;");

then C<dtext> will be reported as "\xE2\x99\xA5\x{2665}" without
C<utf8_mode> enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled.
The later string is what you want.

This option is only available with perl-5.8 or better.
Alias: bz-utf8
  Okay, apparently the problem is that STDOUT does not have "binmode STDOUT, ':utf8'" set appropriately. I don't know if this is something we should be expecting CGI to do for us, or if it's something we really ought to be doing, but it certainly seems to make things work right with this patch.
Attached patch v4 (obsolete) (deleted) — Splinter Review
Okay, this version fixes the display and searching of things with utf8 in them, but the administrative pages are still broken ("no such product"), so I have to look into that.
Attachment #272966 - Attachment is obsolete: true
Attached patch v5 (deleted) — Splinter Review
Okay, here we go! This version works in every area of Bugzilla that I could test.

  Here are the things that had to be fixed:

  1. The DB has to return utf-8 and accept utf-8 strings.
  2. The url_quote filter had to be fixed to correctly quote utf-8 strings. 
     Note that in Template-Toolkit 2.16 and above, the "uri" filter already
     does this for us, but we're not using that yet.
  3. CGI.pm had to be fixed to correctly return utf8 strings instead of bytes.
  4. We have to set "binmode STDOUT, ':utf8'" whenever we're going to output
     something. In this patch, we do it correctly in the CGI. checksetup.pl
     or other command-line scripts may need some other way to set it.
Attachment #286104 - Attachment is obsolete: true
Attachment #286357 - Flags: review?(wurblzap)
Attachment #286357 - Flags: review?(wicked)
(In reply to comment #19)
>   4. checksetup.pl
>      or other command-line scripts may need some other way to set it.

Do you mean bug 382398 comment 1?
(In reply to comment #20)
> Do you mean bug 382398 comment 1?

  Yeah, that's the right bug, thank you. :-) We'll fix that there, not here.
Blocks: 388723
Comment on attachment 286357 [details] [diff] [review]
v5

Great patch, both in terms how well it works and how it's done!

I neither checked nor looked at email_in.pl. Pg is out of scope for me as well.

>RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v
>-                    version => '2.9003',
>-                    # Certain versions are broken, development versions are
>-                    # always disallowed.
>-                    blacklist => ['^3\.000[3-6]', '_'],
>+                    # For UTF-8 support
>+                    version => '4.00',

Don't we want to blacklist development versions any more? If this is a glitch, then it can be fixed on checkin.

>RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v
> sub url_quote {
>     my ($toencode) = (@_);
>+    utf8::encode($toencode) # The below regex works only on bytes
>+        if Bugzilla->params->{'utf8'} && utf8::is_utf8($toencode);
>     $toencode =~ s/([^a-zA-Z0-9_\-.])/uc sprintf("%%%02x",ord($1))/eg;
>     return $toencode;
> }

This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It looks strange to me, though -- please reassure me that this is legal :)
Attachment #286357 - Flags: review?(wurblzap) → review+
(In reply to comment #22)
> Don't we want to blacklist development versions any more? If this is a glitch,
> then it can be fixed on checkin.

  Yeah, it's a glitch. I'll fix that on checkin.

> This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It
> looks strange to me, though -- please reassure me that this is legal :)

  Hahaha. Yes, it is legal. :-) Those are the byte values, which we reassemble into utf-8 when they come back through CGI.pm.
Attachment #286357 - Flags: review?(wicked)
Flags: approval+
Checking in Bugzilla.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v  <--  Bugzilla.pm
new revision: 1.62; previous revision: 1.61
done
Checking in email_in.pl;
/cvsroot/mozilla/webtools/bugzilla/email_in.pl,v  <--  email_in.pl
new revision: 1.10; previous revision: 1.9
done
Checking in Bugzilla/CGI.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/CGI.pm,v  <--  CGI.pm
new revision: 1.34; previous revision: 1.33
done
Checking in Bugzilla/Constants.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v  <--  Constants.pm
new revision: 1.85; previous revision: 1.84
done
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.13; previous revision: 1.12
done
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.63; previous revision: 1.62
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.55; previous revision: 1.54
done
Checking in Bugzilla/DB/Pg.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Pg.pm,v  <--  Pg.pm
new revision: 1.25; previous revision: 1.24
done
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Blocks: 405350
Blocks: 405362
Blocks: 405404
(In reply to comment #11)
> Note that DBD::mysql 4.000 is not yet available in all distros, including on
> Windows. And not everybody agrees to use CPAN. So we shouldn't require such a
> high version as a minimum.

Yup, not available for RHEL either.  And that's one package nobody wants to screw with on production systems. :(

Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6.
(In reply to comment #25)
> Yup, not available for RHEL either.  And that's one package nobody wants to
> screw with on production systems. :(

  No, it's available from rpmforge as DBD-mysql (different capitalization).

> Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6.

  We use it fine on landfill.
Blocks: 406462
Blocks: 408446
Keywords: relnote
Blocks: 410902
Blocks: 400256
Blocks: 395505
Added to the release notes for Bugzilla 3.2 in a patch on bug 432331.
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: