363153 - (bz-utf8) Prepare Bugzilla so that string functions like substr() work with UTF-8 characters

Reporter

Description

•

18 years ago

String functions like substr() seem to work on bytes instead of on characters. This doesn't matter as long as characters are one byte each. Non-English languages frequently use multiple-byte characters, though, and these occasionally get cut in half (for example in abbreviated columns in bug lists), resulting in dummy replacement characters in display.

Strings need to have Perl's Unicode flag set so that functions like substr() work on characters instead of on bytes. This means that all sources Bugzilla gets its strings from (database, environment (?), CGI, templates, code constants) need to do this. For databases, http://aspn.activestate.com/ASPN/docs/ActivePerl/5.8/lib/Pod/perlunicode.html implies that stuff needs to be done to achieve this (in the "A wrapper for fetchrow_array and fetchrow_hashref" section). Other sources may do it by default.

Max Kanat-Alexander

Assignee

Comment 1

•

18 years ago

I'd say this is a blocker.

Flags: blocking3.0+

Max Kanat-Alexander

Assignee

Comment 2

•

18 years ago

Okay, I looked into it more, and this could have some far-reaching implications.

It's actually likely to work just fine, but the problem is that some perl modules may not understand utf8, and may "downgrade" any particular string to be not utf8. Because I don't know which modules will do this, I don't want to make this change while we're frozen.

If there are modules that do this, it could cause strange bugs that take forever to track down, and are hard to fix. I don't want to have to extensively change our module requirements while we're frozen.

So, we'll just have to live with this in 3.0 and wait for 3.2 to fix it.

Flags: blocking3.0+ → blocking3.0-

Priority: -- → P1

Target Milestone: Bugzilla 3.0 → Bugzilla 3.2

Max Kanat-Alexander

Assignee

Updated

•

18 years ago

Assignee: general → mkanat

Max Kanat-Alexander

Assignee

Comment 3

•

18 years ago

Okay, for reference, this can be enabled in DBD::mysql with:

$dbh->{mysql_enable_utf8} = 1

And in DBD::Pg with:

$dbh->{pg_enable_utf8} = 1

(We don't need to do it in Pg yet, though, since I don't think we store things there as UTF-8, yet.)

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Severity: normal → enhancement

Component: Bugzilla-General → Database

Max Kanat-Alexander

Assignee

Comment 7

•

17 years ago

Attached patch v1 (obsolete) (deleted) — Details — Splinter Review

Okay, this is a very simple patch, but it could have far-reaching effects, as I mentioned above.

We need to test Bugzilla everywhere with multi-byte data, with this patch, particularly places that process strings or pass strings from the DB to CPAN modules.

I have a test installation (on MySQL) with this patch applied, here:

http://landfill.bugzilla.org/utf8/

We also should test it on PgSQL at some point.

Attachment #272961 - Flags: review?

Max Kanat-Alexander

Assignee

Comment 8

•

17 years ago

Attached patch v2 (obsolete) (deleted) — Details — Splinter Review

Fixed the Pg part. No other changes.

Attachment #272961 - Attachment is obsolete: true

Attachment #272963 - Flags: review?

Attachment #272961 - Flags: review?

Max Kanat-Alexander

Assignee

Comment 9

•

17 years ago

Comment on attachment 272963 [details] [diff] [review]
v2

Something is wrong here, because I can pull a multibyte summary from the DB, but it doesn't have the utf8 flag on.

Attachment #272963 - Flags: review? → review-

Max Kanat-Alexander

Assignee

Comment 10

•

17 years ago

Attached patch v3 (obsolete) (deleted) — Details — Splinter Review

Okay, so the problem was that mysql_enable_utf8 actually requires DBD-mysql 4.0.

Attachment #272963 - Attachment is obsolete: true

Attachment #272966 - Flags: review?

Frédéric Buclin

Comment 11

•

17 years ago

Note that DBD::mysql 4.000 is not yet available in all distros, including on Windows. And not everybody agrees to use CPAN. So we shouldn't require such a high version as a minimum.

Teemu Mannermaa (:wicked)

Comment 12

•

17 years ago

Comment on attachment 272966 [details] [diff] [review]
v3

>Index: Bugzilla/Constants.pm
>===================================================================
>-                    # Certain versions are broken, development versions are
>-                    # always disallowed.
>-                    blacklist => ['^3\.000[3-6]', '_'],

You shouldn't remove development versions from the blacklist so leave the '_' alone.

Teemu Mannermaa (:wicked)

Comment 13

•

17 years ago

Comment on attachment 272966 [details] [diff] [review]
v3

This breaks product names:
 - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are replaced with unknown char symbol).
 - Product with non-latin chars: "undef error - Wide character in subroutine entry at .../HTML/Scrubber.pm line 322."
 - Any link or function that gets a non-latin product name ends up not finding the product.

Adding comment with non-latin characters throws an error: "undef error - Wide character in subroutine entry at .../Email/MIME/Encodings.pm line 26." but does add the comment.

So this patch clearly makes things worse instead of fixing anything.

Besides, this attribute has warning "This option is experimental and may change in future versions" so I'm not sure we should be using it in the first place.

Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES" for the MySQL DB is not needed?

What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals with utf8 string correctly?

Attachment #272966 - Flags: review? → review-

Max Kanat-Alexander

Assignee

Comment 14

•

17 years ago

(In reply to comment #13)
> (From update of attachment 272966 [details] [diff] [review])
> This breaks product names:
>  - Product with non-ASCII chars is shown incorrectly (non-ASCII chars are
> replaced with unknown char symbol).

  And you're running in UTF-8 mode and you ran recode.pl and your DB is encoded correctly?

>  - Product with non-latin chars: "undef error - Wide character in subroutine
> entry at .../HTML/Scrubber.pm line 322."

  Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It should, though...I thought we required a particular version specifically for that reason.

> Have you verified that setting "SET CHARACTER SET" in addition to "SET NAMES"
> for the MySQL DB is not needed?

  Yes.

> What about "use utf8" Perl pragma, shouldn't that be enabled so that Perl deals
> with utf8 string correctly?

  No, that was only required in Perl 5.6.

Teemu Mannermaa (:wicked)

Comment 15

•

17 years ago

(In reply to comment #14)
>   And you're running in UTF-8 mode and you ran recode.pl and your DB is
> encoded correctly?

Yes, UTF8 mode is enabled. Data came from bugzilla-tip as is but I did run checksetup.pl that converted copied database to UTF8. Problems happened with new non-latin product I just created for this test. The non-ASCII product was already there. I should mention that without this patch, things I mentioned work again.

Frédéric Buclin

Comment 16

•

17 years ago

(In reply to comment #14)
> >  - Product with non-latin chars: "undef error - Wide character in subroutine
> > entry at .../HTML/Scrubber.pm line 322."
> 
>   Arrrgh. So perhaps HTML::Scrubber doesn't work right with utf-8 chars? It
> should, though...I thought we required a particular version specifically for
> that reason.

I'm not sure HTML::Scrubber is the culprit, by looking at its source code. We require HTML::Parser 3.40 or newer so that we can enable the utf8_mode parameter when running on Perl 5.8 or newer, which is always the case since Bugzilla 3.0.

Here is what HTML::Parser says about this utf8_mode parameter:

=item $p->utf8_mode( $bool )

Enable this option when parsing raw undecoded UTF-8.  This tells the
parser that the entities expanded for strings reported by C<attr>,
C<@attr> and C<dtext> should be expanded as decoded UTF-8 so they end
up compatible with the surrounding text.

If C<utf8_mode> is enabled then it is an error to pass strings
containing characters with code above 255 to the parse() method, and
the parse() method will croak if you try.

Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8
encoded.  The character can also be represented by the entity
"&hearts;" or "&#x2665".  If we feed the parser:

  $p->parse("\xE2\x99\xA5&hearts;");

then C<dtext> will be reported as "\xE2\x99\xA5\x{2665}" without
C<utf8_mode> enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled.
The later string is what you want.

This option is only available with perl-5.8 or better.

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Alias: bz-utf8

Max Kanat-Alexander

Assignee

Comment 17

•

17 years ago

  Okay, apparently the problem is that STDOUT does not have "binmode STDOUT, ':utf8'" set appropriately. I don't know if this is something we should be expecting CGI to do for us, or if it's something we really ought to be doing, but it certainly seems to make things work right with this patch.

Max Kanat-Alexander

Assignee

Comment 18

•

17 years ago

Attached patch v4 (obsolete) (deleted) — Details — Splinter Review

Okay, this version fixes the display and searching of things with utf8 in them, but the administrative pages are still broken ("no such product"), so I have to look into that.

Attachment #272966 - Attachment is obsolete: true

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

URL: http://landfill.bugzilla.org/utf8/

Max Kanat-Alexander

Assignee

Comment 19

•

17 years ago

Attached patch v5 (deleted) — Details — Splinter Review

Okay, here we go! This version works in every area of Bugzilla that I could test.

  Here are the things that had to be fixed:

  1. The DB has to return utf-8 and accept utf-8 strings.
  2. The url_quote filter had to be fixed to correctly quote utf-8 strings. 
     Note that in Template-Toolkit 2.16 and above, the "uri" filter already
     does this for us, but we're not using that yet.
  3. CGI.pm had to be fixed to correctly return utf8 strings instead of bytes.
  4. We have to set "binmode STDOUT, ':utf8'" whenever we're going to output
     something. In this patch, we do it correctly in the CGI. checksetup.pl
     or other command-line scripts may need some other way to set it.

Attachment #286104 - Attachment is obsolete: true

Attachment #286357 - Flags: review?(wurblzap)

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Attachment #286357 - Flags: review?(wicked)

Vitaly Fedrushkov

Comment 20

•

17 years ago

(In reply to comment #19)
>   4. checksetup.pl
>      or other command-line scripts may need some other way to set it.

Do you mean bug 382398 comment 1?

Max Kanat-Alexander

Assignee

Comment 21

•

17 years ago

(In reply to comment #20)
> Do you mean bug 382398 comment 1?

  Yeah, that's the right bug, thank you. :-) We'll fix that there, not here.

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 388723

Marc Schumann [:Wurblzap]

Reporter

Comment 22

•

17 years ago

Comment on attachment 286357 [details] [diff] [review]
v5

Great patch, both in terms how well it works and how it's done!

I neither checked nor looked at email_in.pl. Pg is out of scope for me as well.

>RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v
>-                    version => '2.9003',
>-                    # Certain versions are broken, development versions are
>-                    # always disallowed.
>-                    blacklist => ['^3\.000[3-6]', '_'],
>+                    # For UTF-8 support
>+                    version => '4.00',

Don't we want to blacklist development versions any more? If this is a glitch, then it can be fixed on checkin.

>RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v
> sub url_quote {
>     my ($toencode) = (@_);
>+    utf8::encode($toencode) # The below regex works only on bytes
>+        if Bugzilla->params->{'utf8'} && utf8::is_utf8($toencode);
>     $toencode =~ s/([^a-zA-Z0-9_\-.])/uc sprintf("%%%02x",ord($1))/eg;
>     return $toencode;
> }

This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It looks strange to me, though -- please reassure me that this is legal :)

Attachment #286357 - Flags: review?(wurblzap) → review+

Max Kanat-Alexander

Assignee

Comment 23

•

17 years ago

(In reply to comment #22)
> Don't we want to blacklist development versions any more? If this is a glitch,
> then it can be fixed on checkin.

  Yeah, it's a glitch. I'll fix that on checkin.

> This turns Insídeṛ into Ins%C3%ADde%E1%B9%9B, which appears to work. It
> looks strange to me, though -- please reassure me that this is legal :)

  Hahaha. Yes, it is legal. :-) Those are the byte values, which we reassemble into utf-8 when they come back through CGI.pm.

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Attachment #286357 - Flags: review?(wicked)

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Flags: approval+

Max Kanat-Alexander

Assignee

Comment 24

•

17 years ago

Checking in Bugzilla.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v  <--  Bugzilla.pm
new revision: 1.62; previous revision: 1.61
done
Checking in email_in.pl;
/cvsroot/mozilla/webtools/bugzilla/email_in.pl,v  <--  email_in.pl
new revision: 1.10; previous revision: 1.9
done
Checking in Bugzilla/CGI.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/CGI.pm,v  <--  CGI.pm
new revision: 1.34; previous revision: 1.33
done
Checking in Bugzilla/Constants.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v  <--  Constants.pm
new revision: 1.85; previous revision: 1.84
done
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.13; previous revision: 1.12
done
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.63; previous revision: 1.62
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.55; previous revision: 1.54
done
Checking in Bugzilla/DB/Pg.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Pg.pm,v  <--  Pg.pm
new revision: 1.25; previous revision: 1.24
done

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 405350

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 405362

Frédéric Buclin

Updated

•

17 years ago

Blocks: 405404

Dave Miller [:justdave]

Comment 25

•

17 years ago

(In reply to comment #11)
> Note that DBD::mysql 4.000 is not yet available in all distros, including on
> Windows. And not everybody agrees to use CPAN. So we shouldn't require such a
> high version as a minimum.

Yup, not available for RHEL either.  And that's one package nobody wants to screw with on production systems. :(

Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6.

Max Kanat-Alexander

Assignee

Comment 26

•

17 years ago

(In reply to comment #25)
> Yup, not available for RHEL either.  And that's one package nobody wants to
> screw with on production systems. :(

  No, it's available from rpmforge as DBD-mysql (different capitalization).

> Guess nobody will be able to use Bugzilla 3.2 or newer until RHEL6.

  We use it fine on landfill.

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 406462

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 408446

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Keywords: relnote

Frédéric Buclin

Updated

•

17 years ago

Blocks: 410902

Marc Schumann [:Wurblzap]

Reporter

Updated

•

17 years ago

Blocks: 400256

Max Kanat-Alexander

Assignee

Updated

•

17 years ago

Blocks: 395505

Max Kanat-Alexander

Assignee

Comment 31

•

16 years ago

Added to the release notes for Bugzilla 3.2 in a patch on bug 432331.

Keywords: relnote

v1 17 years ago Max Kanat-Alexander (deleted), patch		Details \| Diff \| Splinter Review
v2 17 years ago Max Kanat-Alexander (deleted), patch	mkanat : review-	Details \| Diff \| Splinter Review
v3 17 years ago Max Kanat-Alexander (deleted), patch	wicked : review-	Details \| Diff \| Splinter Review
v4 17 years ago Max Kanat-Alexander (deleted), patch		Details \| Diff \| Splinter Review
v5 17 years ago Max Kanat-Alexander (deleted), patch	Wurblzap : review+	Details \| Diff \| Splinter Review