Closed Bug 304550 Opened 19 years ago Closed 18 years ago

Bugzilla should always store data in MySQL as UTF-8

Categories

(Bugzilla :: Database, defect, P2)

2.21

Tracking

()

RESOLVED FIXED
Bugzilla 3.0

People

(Reporter: yanzhenjiao, Assigned: mkanat)

References

()

Details

Attachments

(1 file, 3 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Build Identifier: 

i want to tanslate the bugzilla into chinese,but it cann't display chinese.
so i modify the following file:
C:\bugzilla\Bugzilla\DB\mysql.pm
in the new subfunction,i add this line
#!!! i add this line!!! 
  $self->do("set names gb2312");

the code is:
sub new {
    my ($class, $user, $pass, $host, $dbname, $port, $sock) = @_;

    # construct the DSN from the parameters we got
    my $dsn = "DBI:mysql:host=$host;database=$dbname";
    $dsn .= ";port=$port" if $port;
    $dsn .= ";mysql_socket=$sock" if $sock;
    
    my $self = $class->db_new($dsn, $user, $pass);
  #!!! i add this line!!! 
   $self->do("set names gb2312");
  
  # all class local variables stored in DBI derived class needs to have
    # a prefix 'private_'. See DBI documentation.
    $self->{private_bz_tables_locked} = "";

    bless ($self, $class);
    
    return $self;
}

Reproducible: Always
a dupe of bug 288346?
Not exactly. This bug has to do with a slightly larger problem that we may run
into once we enable UTF-8 support.

What the reporter has asked for is actually either INVALID or just a docs note,
but what we do actually need is the ability to possibly set the database to be
interpreted and stored as UTF-8 for new installations. So I'm going to take over
this bug for that purpose.

This has various extreme complications because of how perl deals with UTF-8, how
DBI deals (or doesn't deal) with UTF-8, and how the various database backends
deal (or don't deal) with UTF-8.

Somebody needs to investigate what the problems are if we *don't* set the
database to use UTF-8.
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: PC → All
Summary: it need to set database chatset to display chinese correctly. → Need to be able to set database charset (to UTF-8?)
Version: unspecified → 2.21
Oh, and if somebody does test, make sure that you also test Unicode characters
with a Unicode value greater than 10,000 -- I've had problems with those before,
on DBD::Pg, at least.
This requires a reinitdb for postgres, doesn't it? Or at least recreating the db.
(In reply to comment #4)
> This requires a reinitdb for postgres, doesn't it? Or at least recreating the 
> db.

  Yeah. As far as I know it would require a pg_dump and then a re-import of the
DB for Pg.
my bug report is a suggest for bugzilla's function.
now,bugzilla is widely used all around the world.
so,the multi-language support should be appreciated.there are some people
who have tanslate the bugzilla from english to the other languages,but,
when Bugzilla  connecting the database,it hasn't deal with the charset.
if this feature is added,we can use it with our own language.

 
Assignee: database → mkanat
Summary: Need to be able to set database charset (to UTF-8?) → Bugzilla should use UTF-8 to connect to the database when the utf8 parameter is on
Target Milestone: --- → Bugzilla 2.24
*** Bug 328013 has been marked as a duplicate of this bug. ***
*** Bug 331583 has been marked as a duplicate of this bug. ***
This is a bug now, because bug 331583 reports that it causes data to get encoded twice.
Severity: enhancement → major
Priority: -- → P2
Target Milestone: Bugzilla 2.24 → Bugzilla 2.22
Okay, for MySQL I think what we need to do is SET CHARACTER SET 'utf8' if the UTF-8 param is on. That way it will correctly convert between our input UTF-8 and whatever encoding the database is. If the database is also UTF-8, it still works.
If you do it, you also need add comments in rel_notes for users of 2.22rc1. Because after upgrading users can't access non ASCII characters in Bugzilla.
They must repair bugs database.
I make it by next commands:

mysqldump.exe --default-character-set=latin1 -ubugs -p<yourpass> bugs >bugs.sql
!!! need manual correction of bugs.sql
mysql --default-character-set=utf8 -ubugs -p<yourpass> bugs <bugs.sql

Unfortunately current current version of mysqldump has bug - "mysqldump does not dumps tinytext and mediumtext default" - (we reported it as http://bugs.mysql.com/?id=18482), but after manual correction of 'bugs' table definition ('status_whiteboard', 'keywords' columns) and  'groups' table definition ('userregexp' column) this work right 

Blocks: 332190
*** Bug 344342 has been marked as a duplicate of this bug. ***
Max, are you working on this?
Apparently not. Taking for now; I feel that I may need some help soon, though.
Assignee: mkanat → wurblzap
Okay, I'm going to take this, now, and actually work on it. I suspect we'll only be able to fix it in 3.0, because we'll have to convert the whole database's character set to UTF-8. However, I'm not sure this is even possible in MySQL 4.0.

As such, in order to actually repair this bug, we will have to require MySQL 4.1. It's a significant-enough bug that all users should have it fixed. However, fixing it programatically in checksetup.pl would be impossible to do without MySQL 4.1.

See also:

http://dev.mysql.com/doc/refman/4.1/en/charset.html

FWIW, my goal is to eventually be able to eliminate the "utf8" parameter and have Bugzilla always use UTF-8, and this would be a critical part of it. (The other part is converting old databases, but I have that pretty well in-hand in bug 280633.)

I'm pretty sure we can fix this bug safely without that, though. We just will have to require MySQL 4.1.
Assignee: wurblzap → mkanat
Status: NEW → ASSIGNED
Summary: Bugzilla should use UTF-8 to connect to the database when the utf8 parameter is on → Bugzilla should always store data in MySQL as UTF-8
Target Milestone: Bugzilla 2.22 → Bugzilla 3.0
I think this is a blocker, because our UTF-8 support just can't really work without this patch. The reason that I'm nominating it instead of granting it directly is that I need justdave's approval to bump the MySQL version requirement.

I think it's entirely reasonable to bump the requirement. RHEL4 has MySQL 4.1, and so have most other distros for years.
Flags: blocking3.0?
Attached patch v1 (obsolete) (deleted) β€” β€” Splinter Review
Okay, I can vouch for this patch. It should work just fine, since all it's doing is telling MySQL that it should collate things as though they were UTF-8. It doesn't actually do any real character set *conversion*, it just changes the storage format in the MySQL backend to UTF-8.

Still, I'd like some testing of this from somebody who's more familiar with localization than I am. I want to know more about how it behaves with the UTF-8 parameter both on and off.
Attachment #239881 - Flags: review?
Attachment #239881 - Flags: review? → review?(wurblzap)
Attached patch v2 (obsolete) (deleted) β€” β€” Splinter Review
Okay, the previous version was destroying multi-byte encodings, but this version works fine.

HOWEVER, we seem to have a problem with MySQL *destroying* ISO-8859-X data, where X != 1. It absolutely shouldn't be doing this, but it is. So I'm looking into it.

I tested it with ISO-8859-7 (Greek) and it ate the characters.
Attachment #239881 - Attachment is obsolete: true
Attachment #240066 - Flags: review?(wurblzap)
Attachment #239881 - Flags: review?(wurblzap)
I've reported my problem to MySQL as: http://bugs.mysql.com/22719

We'll see what they say.
Okay, as far as I can tell, MySQL's response is "MySQL is supposed to eat your data that way."

So, we have a problem:

If people have invalid UTF-8 in their database, it may be physically deleted. :-(

However, without this patch, MySQL definitely can't correctly sort non-English data. It may not even be able to store UTF-8 data correctly, always.

We do have recode.pl going in, from bug 280633.

So, my suggestion is this:

1) We do this, but at this point in checksetup we put a BIG WARNING that you
   must now run recode.pl before you continue. Because this is 3.0, I think
   it would be acceptable to require this.

2) After people have run recode.pl (We can perhaps keep track of it by a field
   in bz_schema), then they can run checksetup again to upgrade their tables.

Of course, I need justdave's approval for this. :-)
Okay, I have a better solution:

1) If the installation has the utf-8 flag on (meaning its new or its all in utf-8), we do the table conversion, and permanently set a column in bz_schema to note that the DB is utf-8. However, before we do the conversion we print a big warning that it will destroy data and that people should run recode.pl if they ever had any non-utf8 data in their DB.

2) Old DBs continue running just the same until they turn on utf8.

3) When people turn on the utf8 param, we note that they should run checksetup.pl to convert their database.
Note that the only problem with the above solution is that we have to maintain the utf8-conversion code, and it will have to work on all future versions of Bugzilla. Right now, it only has to work once, during the 3.0 upgrade.

But the above solution is safer. Perhaps we can do it this way for 3.0, and then for 3.2 we can make it mandatory, and switch on the utf8 parameter always.
Attached patch v3 (obsolete) (deleted) β€” β€” Splinter Review
Okay, this version implements what I described above. I'm signing off, myself, on the DB part of the code, so you don't have to review that at all.

However, this patch also touches editparams.cgi and some params-related code, so I need review for that. LpSolit, could you just look at that part of the code? I've tested it all.
Attachment #240066 - Attachment is obsolete: true
Attachment #241154 - Flags: review?(LpSolit)
Attachment #240066 - Flags: review?(wurblzap)
Oh, and justdave allowed in our last meeting to bump the requirement to 4.1, so that makes this OK to be a blocker.
Depends on: bz-recode
Flags: blocking3.0? → blocking3.0+
Comment on attachment 241154 [details] [diff] [review]
v3

>Index: Bugzilla/Config/Common.pm

>+sub check_utf8 {

>+        return "You cannot disable UTF-8 support, because your MySQL database"
>+               . " is encoded in UTF-8.";

Don't write the final period. There is already one automatically added. Funny, isn't it? :)



>Index: Bugzilla/DB/Schema.pm

>+sub convert_type {

>+=cut
>+    my ($self, $type) = @_;

You *must* add a newline after =cut, else 011pod.t throws an error.



>Index: template/en/default/admin/params/core.html.tmpl

>           "problems. Existing databases should set this to true only after " _
>           "the data has been converted from existing legacy character " _
>-          "encodings to UTF-8.",

Please make this string UTLRA-BOLD! I just tried turning this parameter on and running checksetup.pl again, and all my ISO-8859-1 data has been deleted. :( So please make it clear here that non-UTF-8 will be deleted.


I was going to r+ it, with my comments above fixed, but justdave just told me on IRC that the correct behavior was to reject the conversion and force the user to run recode.pl first. So that's the reason of my r-.


<justdave> mkanat was talking about that the other day...
<justdave> I thought we had decided to force users to run the recode script first or something
<justdave> in other words, yes, it's supposed to reject the conversion if there's non-utf data still present, and tell you to run recode.pl
<justdave> perhaps he hasn't redone the patch yet since we had that conversation
Attachment #241154 - Flags: review?(LpSolit) → review-
(In reply to comment #25)
> <justdave> in other words, yes, it's supposed to reject the conversion if
> there's non-utf data still present, and tell you to run recode.pl
> <justdave> perhaps he hasn't redone the patch yet since we had that
> conversation

  No, I have redone the patch. It's just that now it only does the conversion if you've turned on the "utf8" parameter.

  I can't tell if there's any ISO-8859-X data in the database without going through all of the database in a loop.

  I don't want to *always* stop checksetup.pl at this point.

  I just assume that if somebody has the "utf8" parameter on, their data is in utf8.

  checksetup printed out a large warning to you that it would delete things, when you ran it, and it told you to run recode.pl. Did you see the warning?
Attached patch v4 (deleted) β€” β€” Splinter Review
Okay, here's a version that fixes your comments.

I agree that it's a dangerous thing for checksetup.pl to do, but because it's 3.0 I think it's reasonable to expect people to have to do something manually in the middle of the installation.

We should definitely mention this in the release notes, though.
Attachment #241154 - Attachment is obsolete: true
Attachment #241368 - Flags: review?(LpSolit)
Keywords: relnote
Whiteboard: [relnote comment 27]
Comment on attachment 241368 [details] [diff] [review]
v4

r=LpSolit
Attachment #241368 - Flags: review?(LpSolit) → review+
Flags: approval?
*** Bug 355994 has been marked as a duplicate of this bug. ***
*** Bug 355995 has been marked as a duplicate of this bug. ***
Flags: approval? → approval+
leaving this in the approval queue so I continue to see it, but I'm going to sit on this until I play with the recode thing (this week, I promise).  I think it'll be a VERY good idea to make people run recode before running this, so recode needs to be a prerequisite (hmm, dependency is already set - let's wait for it)
Flags: approval+ → approval?
*** Bug 330430 has been marked as a duplicate of this bug. ***
Flags: approval? → approval+
Checking in Bugzilla/Constants.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Constants.pm,v  <--  Constants.pm
new revision: 1.62; previous revision: 1.61
done
Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.91; previous revision: 1.90
done
Checking in Bugzilla/Config/Common.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Config/Common.pm,v  <--  Common.pm
new revision: 1.13; previous revision: 1.12
done
Checking in Bugzilla/Config/Core.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Config/Core.pm,v  <--  Core.pm
new revision: 1.7; previous revision: 1.6
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.45; previous revision: 1.44
done
Checking in Bugzilla/DB/Schema.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Schema.pm,v  <--  Schema.pm
new revision: 1.76; previous revision: 1.75
done
Checking in Bugzilla/DB/Schema/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Schema/Mysql.pm,v  <--  Mysql.pm
new revision: 1.14; previous revision: 1.13
done
Checking in template/en/default/admin/params/core.html.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/admin/params/core.html.tmpl,v  <--  core.html.tmpl
new revision: 1.5; previous revision: 1.4
done
Checking in template/en/default/global/messages.html.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/global/messages.html.tmpl,v  <--  messages.html.tmpl
new revision: 1.47; previous revision: 1.46
done
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Blocks: 330599
Added to the release notes on bug 349423.
Keywords: relnote
Whiteboard: [relnote comment 27]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: