Closed Bug 126266 (bz-charset) Opened 23 years ago Closed 19 years ago

Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations

Categories

(Bugzilla :: Bugzilla-General, defect, P1)

2.15

Tracking

()

RESOLVED FIXED
Bugzilla 2.22

People

(Reporter: burnus, Assigned: glob)

References

Details

(Whiteboard: i18n)

Attachments

(2 files, 30 obsolete files)

(deleted), patch
Wurblzap
: review+
Details | Diff | Splinter Review
(deleted), patch
cso
: review+
Details | Diff | Splinter Review
Presently the bugzilla webpages don't contain an encoding header. Neither do the emails. Expected: - The HTML pages come with an encoding header such as: <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> - The emails come with an encoding header such as: MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=ISO-8859-1 Content-transfer-encoding: 8BIT Reasoning: The encoding information makes sure that the 8bit characters are shown correctly. I have chosen ISO-8859-1 (Latin1) since it is most spread (though not as "good" as UTF8) and is the default encoding of MySQL.
The patch changes defparams.pl so only new installations of Bugzilla are effected. Additionally this value can easily be changed on the "parameters" page.
Some small case changes should be made, so that the out looks this this: MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=iso-8859-1 Content-Transfer-Encoding: 8bit instead of: MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=ISO-8859-1 Content-transfer-encoding: 8BIT
Changed "ISO" to "iso" and "8BIT" to "8bit". Could someone mark attachement 70138, I'm not allowed to do so
"MIME-version: 1.0" should be "MIME-Version: 1.0"
"Content-type" should be "Content-Type" "Content-transfer-encoding" should be "Content-Transfer-Encoding"
> "Content-type" should be "Content-Type" > "Content-transfer-encoding" should be "Content-Transfer-Encoding" Fixed. I should really go to bed ... ("Obsolet" marking is bug 97729 by the way)
Keywords: patch, review
Comment on attachment 70148 [details] [diff] [review] defparams.pl patch (v3): Send per default the encoding for HTML (header) and for emails Can't use a META tag for content-type. It's broken and makes Bad Things happen in Netscape 4.x. Need to actually send a charset parameter on the Content-Type header being spit out in the HTTP. Please see the discussion on bug 38856 for why this was refused entry the last time it was presented, and take anything into consideration from that bug that we need to do to keep everyone happy.
Attachment #70148 - Flags: review-
Attachment #70138 - Attachment is obsolete: true
Attachment #70144 - Attachment is obsolete: true
Attachment #70144 - Attachment is patch: true
Keywords: patch, review
Attached patch Bigger patch for text/html (v4) (obsolete) (deleted) — Splinter Review
This patch addresses the problems by replacing the print "Content-Type: text/html\n\n" by a function. This patch was _not_ thoroughly tested, the %...% part in the email settings is untested (to come...). In additionally the documentation (3.5.5) needs to be updated if this is checked in.
Attached patch Bigger patch for text/html (v5) (obsolete) (deleted) — Splinter Review
Now tested. Changes to previous version HTMLencoding -> encoding (since used by mail and needed for %encoding% substitution %encoding% substition works now
Attached patch Bigger patch for text/html (v5) (obsolete) (deleted) — Splinter Review
Now tested. Changes to previous version - HTMLencoding -> encoding (since used by mail and needed for %encoding% substitution) - %encoding% substition works now (in the email params)
Keywords: patch, review
--- reports.cgi 2002/01/31 23:51:38 1.51 +++ reports.cgi 2002/02/19 11:17:36 + PutHTMLContentType("Content-disposition: inline; filename=bugzilla_report.html"; I missed a ")" before the semicolon.
Question: what does it do if you leave it blank? (I haven't looked at the patch yet, but easier to ask you than dig through the patch :) Does it do the Content-Type: text/plain (or text/html) without the ; charset=xxxx on the end if you leave it blank? (it'll need to work this way for the japanese folks IIRC since they have to be able to change the charset on the fly in the middle of the page)
> Question: what does it do if you leave it blank? (I haven't looked at the > patch yet, but easier to ask you than dig through the patch :) > Does it do the Content-Type: text/plain (or text/html) without the ; > charset=xxxx on the end if you leave it blank? It sends then only the text/html part (text/plain is not supported (yet?)). + if( Param('encoding') ne '') { + print 'Content-Type: text/html; charset='.Param('encoding')."\n$header\n"; + } else { + print "Content-Type: text/html\n$header\n"; > (it'll need to work this way for the japanese folks IIRC since they have to be > able to change the charset on the fly in the middle of the page) Hm. This doesn't sound that healthy actually, but if the browser still likes it ...
Attached patch Bigger patch for text/html (v6) (obsolete) (deleted) — Splinter Review
Fixes missing ')' and re-diff after long_list.cgi and describekeywords.cgi have been templateized.
> It sends then only the text/html part (text/plain is not supported (yet?)). text/plain is how the email sends, correct? :-)
> > It sends then only the text/html part (text/plain is not supported (yet?)). > text/plain is how the email sends, correct? :-) True. I missed this since it is only used in defparams.pl and thus easily customizeable. In this sense it is not true that only "Content-Type: text/plain" appears if "encoding" is empty. I'm rather comfortable having a Content-Type: text/plain; format=flowed; charset=%encoding% in the params' *mail settings, but I can also move this part (mime,8bit,content-type) to another perl function, if it is desired.
Attached patch Bigger patch for text/html (v7) (obsolete) (deleted) — Splinter Review
minor changes to PutHTMLContentType (I confess I missed to initialise a variable using ''; plus: Call Param('encoding') only once).
Attached patch Bigger patch for text/html (v8) (obsolete) (deleted) — Splinter Review
Rediff after relogin.cgi and defparams.pl had been changed.
I think this would greatly accompany bug 126456 (2.16 blocker/"Fix our error handling"). Regarding this: Would it make sense to set $vars->{'header_done'} in the PutHTMLContentType once bug 126456 is checked in or is this the wrong place to do so?
Severity: normal → major
No, vars->{'header_done'} should be set only after the global/header template has been printed. I'm still not convinced about the way you are doing things in this bug, but I need more time to look at it to work out why. ;-) Gerv
--- post_bug.cgi 2002/02/05 00:20:08 1.39 |+++ post_bug.cgi 2002/02/24 15:43:09 |-print "Content-type: text/html\n\n"; |+pPutHTMLContentType(); s/pP/P/ > I'm still not convinced about the way you are doing things in this bug, but I > need more time to look at it to work out why. ;-) Hmm. I thought it wasn't that bad ;-) As long as you can come up with something else with sends the email and the HTML pages with right charset, I'm fine with that.
Comment on attachment 71202 [details] [diff] [review] Bigger patch for text/html (v8) This is the way I think it should work. The function should be called SendHTTPHeader(), and take an array of strings, including a Content-Type. It should print them all, in the order given, with \n separating, but if it spots an HTML Content-Type, it should slyly insert the charset into it. It prints \n\n at the end. This seems to me to be a much cleaner interface, and it works for different content-types too, and is extensible. Gerv
Attachment #71202 - Flags: review-
Attached patch Bigger patch for text/html (v9) (obsolete) (deleted) — Splinter Review
Fix pPut... error and rediff after xml.cgi and userprefs.cgi got checked in.
*** Bug 128609 has been marked as a duplicate of this bug. ***
altering the summary of this bug to more closely match what the patch is actually accomplishing.
Summary: Bugzilla should send encoding ISO-8859-1 per default → Allow administrator to set charset encoding for pages and email
Attached patch v10: Encoding patch for mail and html (obsolete) (deleted) — Splinter Review
-needs work- Goal: - Provide new function "PutHTMLContentHeader" which sends the content-transfer-encoding for HTML - Provide option sending emails with content-transfer-encoding 8bit or quoted-printable (This can be set via the editparams.cgi) Done: - Options are in defparams.pl - PutHTMLContentType is used. - Default email setting uses MIME with %encoding% and %transportencoding% - The email body is either send as 8bit or quoted-printable Todo: - Honour RFC 2047 for the encoding of the header - Use MIME encoding and other features for the other emails which presently are not affected by editparams.cgi - Check whether we need to change something for 16bit characters. I fear that MIME:QuotedPrint doesn't do the right thing in this case. - Do some clean up - Testing: I haven't yet tested the changes between v9 and v10. I'd be glad if someone could assure me that I'm on the right road.
burnus: if you disagree with my assessment of how this would most cleanly be implemented (as written in comment #22), could you at least say why? :-) Gerv
Attached patch v11/v1h: Encoding patch for HTML (obsolete) (deleted) — Splinter Review
I split the two areas mail and html output. This only contains the changes needed for HTML and tries to addess all issues given in comment #22. The only difference is that a "SendHTTPHeader()" is equivalent to a SendHTTPHeader("Content-Type: text/html"). I think this patch is rather clean and independed of addressing the email encoding. Checking this in first, reduces the size of the more complicated email patch. (Does someone know a lightwight perl implementation for the encoding of the header of emails? I have an slight idea how to write it, but it is going to be ugly and lengthy :-( > burnus: if you disagree with my assessment of how this would most cleanly be > implemented (as written in comment #22), could you at least say why? :-) Well the reason is simple: I overlooked this comment :-(
Comment on attachment 72387 [details] [diff] [review] v11/v1h: Encoding patch for HTML >+# This sends a HTTP header >+# It takes an list as argument and prints them \n separated "a list as an argument" :-) >+# If it finds "Content-Type: text/html" and the param "encoding" is set >+# it adds the charsetencoding >+# If called without an argument it assumes that "Content-Type: text/html" is >+# ment. "meant". >+sub SendHTTPHeader(@){ >+ my $header = join("\n",@_); >+ my $encoding = Param('encoding'); >+ if($header eq "") { >+ $header = "Content-Type: text/html"; >+ } $header ||= "Content-Type: text/html" is neater :-) >+DefParam("encoding", >+ "Character encoding used for the HTML documents. (This should match the encoding used by the database.)", >+ "t", >+ 'iso-8859-1'); >+ Please default this to nothing. See long arguments in other bugs for the reason. Other than those nits, r=gerv :-) Gerv
Attachment #72387 - Flags: review+
Attached patch 72387: v12/v2h: Encoding patch for HTML (obsolete) (deleted) — Splinter Review
Fixed the issues which have been risen in comment 29.
Comment on attachment 72860 [details] [diff] [review] 72387: v12/v2h: Encoding patch for HTML r=gerv. Gerv
Attachment #72860 - Flags: review+
*** Bug 129646 has been marked as a duplicate of this bug. ***
129643 contains another patch to fix the content type issue, it removes the content type print's rather than calling a function and moves it into PutHeader. I am sorry i didn't saw this before duplicating work. Just wanted to give a heads up reagarding this patch on this bug.
Attached patch Refresh of 72860 to apply to current HEAD (obsolete) (deleted) — Splinter Review
The 72860 patch has gone stale, and no longer applies cleanly to HEAD. This is mostly just a refresh. There are one or two new places, though, where people are putting additional fields in the header; those bits should be scrutinized for correctness by the responsible parties.
Also, shouldn't the target milestone for this bug be set to 2.16?
It's going to go stale again as soon as bug 84876 lands. Everything email is changing. If everyone insists this is a showstopper I suppose we can put it in 2.16, but I certainly won't make it a blocker. If it gets done before we release, good, otherwise we'll have no qualms about releasing without it. Putting in 2.18 for now... if it gets reviewed and checked in before then we'll bump it up.
Target Milestone: --- → Bugzilla 2.18
It would also be good to set the charset on the output of xml.cgi; the current patch only sets encoding on the html output from xml.cgi, which you get when no bug numbers have been specified. The output should probably specify the encoding in the <?xml?> PI at the start of the output: <?xml version="1.0" encoding="iso8859-1" standalone="yes"?> I originally suggested this in the comments on bug 105960, but it was suggested that it was better to include it with this one. As it is, most XML parsers won't handle output like: http://bugzilla.mozilla.org/xml.cgi?id=384 As it includes 8 bit characters, but is not UTF-8 (XML files default to UTF-8 if no encoding is given).
*** Bug 152190 has been marked as a duplicate of this bug. ***
Attached patch Refresh of patch 77394 to apply to 2_16-BRANCH (obsolete) (deleted) — Splinter Review
I've refreshed this again for the benefit of people who will be running 2.16, as well as for the benefit of anyone who wants to review it for inclusion in 2.17 when it opens (hint, hint).
Attachment #77394 - Attachment is obsolete: true
Blocks: 160096
*** Bug 160097 has been marked as a duplicate of this bug. ***
*** Bug 173227 has been marked as a duplicate of this bug. ***
Heads up, the default charset (for Mozilla?) in Redhat 8 is UTF-8 (unicode), so ISO-8859-1 data entered into b.m.o has been showing up incorrectly for users of that operating system. Depending on the browser support, we may want to default to UTF-8 instead of ISO-8859-1 for new installations. We should also have a recommendation for existing installations about how to migrate data from multiple other charsets into the one they want to use, if this is even possible.
Note that CGI.pm enforces sending a charset on text/* responses. When I mentiont that on IRC, timeless suggested that thatwas a bad idea, becuase it disables autodetection. This is particularly important for attachments, whih may be testing something wrt autodetection in mozilla That patch also makes supporting this a one-liner in Bugzilla::CGI - just add a $self->charset(Param('charset')) call into the B::C constructor. We can't convert existing stuff unless we know what charset it currently is, and we don't.
> Note that CGI.pm enforces sending a charset on text/* responses. Can we work around that by sending a bogus charset? Mozilla might still autodetect if it doesn't recognise it. Gerv
I'd really prefer to fix CGI.pm...
Once I was visiting one site in UTF-8. Then I opened bugzilla and entered a comment containing some accented characters. It's only when I received the mail back from bugzilla that I realised my Mozilla was in UTF-8 while I was writing the comment, and therefore those accented characters turned out to be non-sense. I thus had to re-entered the comment again. If Bugzilla had sent content-type charset in HTML header or HTTP header, this wouldn't have happened. Or even better, Bugzilla is using UTF-8 be default.
Right, but I don't know how well browsers like ns4 deal with utf-8.
With the lastest N4 I had used (4.76, I think), it wouldn't switch to UTF-8. But are there a lot of people using N4 to report Mozilla report? I don't know if Bugzilla keeps a record of connection statistics. If it does, it would be possible to know the percentage of charset-unaware browsers / total browsers (counting only the different kind of browsers used per logged users, but not the number of times accessing Bugzilla). If this is less than 10%, I think it's safe to use UTF-8 because we can't wait forever for 0% to happen. Don't you agree?
Oh, and juding by teh comments at the bottom of http://www.mysql.com/doc/en/Upgrading-from-3.23.html mysql doesn't support utf8 encoding. OTOH, I don't know if we need database support for this. It would be nice, and may make searching a bit easier, but its not essential, I think.
I have no clue what the rate is, but we do need to support ns4. If the only side effect is that ns4 doesn't show non-ascii characters, then I thik we can deal with that - theres really no other option. A quick web search shows that some browsers do have issues with utf8-encoding, although it appeasr that they're ok with the ascii (or maybe latin-1) character sets. Hmm Has anyone got any suggestions for: a) how to store this in the db (remembering that with mysql will then not work correctly on string-based operations with non ascii chars) b) How to convert existing data (pgsql has code within the db to do conversions. We can use Eode, but only on perl 5.8) c) Whether suppoting this fully should require perl 5.8 (I really really really hope not) d) What to do if we want character encoding X but the broweser sends stuff which isn't vald for that (and how we detect that case) e) Whether sending an (admin defined) charset on all text/* documents will cause problems compared to the current no-charset setting? f) Whether we should allow an admin defined charset, or just handle everyhting as utf8. This will probbaly make it much easier to deal with postgres, although I don't know if DBD::Pg handles all that correctly - dkl? g) Anything else?
It all depends on what you mean by support. I use MySQL 3.23.x to store UTF-8 encoded data: Chinese, Japanese, Korean, Arabic, and various Latin script languages. There are a few things to keep in mind: 1. Use Latin-1 as the database encoding. 2. Your char columns should be BINARY since you don't want the server doing case- insensitive string comparisons. 3. When doing wild-card searching and string manipulation you need to account for the fact that a single character may take four bytes (i.e., four Latin 1 characters). 4. Collation will give you Unicode order.
By 'support', I mean not having to do any of those things :) (2) is the main one here, which we need to avoid.
*** Bug 179076 has been marked as a duplicate of this bug. ***
The other thing to note is that we don't really use many db text functions, so the db support doesn't matter that much. If the db doesn't support it, then doing stuff like substring searches based on non-utf8 strings may return funny results. I'm personally OK with that... some of our quoting functions probably need to be updated to properly accept non-utf8 input, mind you. I've also changed my mind on making these columns 'BINARY', mainly because even though that is really likly to break lots of stuff, its the way other dbs do stuff (becase the sql spec says so), so we'll have to deal with i eventually anyway.
Blocks: bz-russian
Blocks: 182975
Re comment 7: Dave, if the META Tag is a problem, shouldn't it be removed from Bugzilla's *.html files ? (bug_status.html, bugwritinghelp.html, confirmhelp.html, quicksearch*.html, votehelp.html) When using our local Bugzilla with Mozilla 1.2* and 1.3a, we got problems with wrong encodings and accidental changend summary lines. Mozilla "randomly" flips default charset between ISO-8859-1 and UTF-8 (Bug 148369, Bug 158285, Bug 159295). A <meta http-equiv= ... "...; iso-8850-1"> tag in template/en/custom/global/header.html.tmpl worksforme as a quick fix.
Re comment 55: This can't be done safely. At least these files _are_ in ISO-8859-1. What about localized Bugzillas? What you suggest is a hack to work around Mozilla troubles, not a Bugzilla fix.
Right, which is why we should just say that everything is utf-8 and be done with? Its a simple standard natively supported by perl.
Re comment 56: No, the .html files are all plain ascii, and should display correctly with any "ascii derived" charset (iso-8859-whatever, utf-8, ...). So if the mysterious NS4.x problem from comment 7 is an issue, the META tags should be removed (and possibly "AddCharset iso-8851-1 .html" added to .htaccess for localized .html files). Yes, the META tag with charset in "global/header.html.tmpl" is a necessary fix that makes localized bugzillas usable with current mozilla releases. And so it also an intermediate fix for this bugzilla bug 126266 ;-) Re comment 57: I agree. But then there should be some support for conversion of existing localized bugzilla installations to utf-8.
As I understand it, we can't switch to UTF-8 until we drop support for NS 4.x. Isn't that right? Our current behaviour of not setting a charset is very useful because it allows people to just start using Bugzilla in their language and browser auto-detect algorithms generally Do The Right Thing. I think Bugzilla should definitely continue ship with no default charset. However, making it easier for admins to add one is perfectly reasonable, and this patch looks like the right idea (although it would need to add charset to more Content-Types than just text/html.) Gerv
gerv: That works for html right up until two people with different charsets comment on a single bug. It also doesn't work for xml, which _must_ be given a charset. If we set utf-8, then any browser will Do The Right Thing. (At least any non-ancient one - I don't know how ns2 or ie2 will act, and I don't particularly care....) What problem does netscape 4 have? The only comment in this bug is justdave's mention that we can't use <meta>, but have to use content-type. Thats OK with me. I mentioned that I don't know if ns4 will work, and thats true. Local testing shows that it works for ascii, though and I'm not set up to try non-ascii stuff. Using a single character set has the advantage that we can use the 'standard' perl features on it. The problem with allowing an admin-settable charset is that we have no way of testing if inputted data is correct. With utf-8, we can use perl5.8's native stuff, or simulate it under 5.6.
As a test, consider http://www.unicode.org/iuc/iuc10/x-utf8.html. I get missing fonts (which display with ? in ns4, and a glyph for the 4 digit codepoint under mozilla), but the stuff which I do have fonts for do display correctly.
I vote for universal UTF-8 coding. This shall be simple, safe and everlasting solution.
UTF-8 everywhere would be nice. The main thing I would like to see is xml.cgi producing formal XML output. Currently it doesn't set a charset in the <?xml?> line, so standard XML parsing tools treat it as UTF-8. This causes big problems for some bugs where non UTF-8 8-bit characters have been used, making the xml parser return an error. Being able to use standard XML tools with bugzilla would be a very useful feature ...
Right, but we have to set some charset and the urrent problem is that we don't have one to use...
I was just pointing out xml.cgi as a reason why it is worth worrying about the charset issue. This is a case where we can't just leave it to the web browser's charset detection heuristics (an XML parser is required to treat the current output of xml.cgi as UTF-8, and then fail when it finds invalid UTF-8 data ...). The two solutions are to allow the bugzilla administrator to set the charset (in which case this setting should be used in xml.cgi's output as well), or decide on a fixed encoding such as UTF-8. Since a lot of people are moving toward UTF-8, the second option is the one I would prefer (even though it is probably more work in the short term).
If can imagine two possible scenarios a) Allow administrator to set character coding he likes. This would mean to care about coding everywhere in the code, and I would expect some work to be done now and many bugs to appear in future (because every programmer would have to think about charset) b) Set one fixed coding to UTF-8 everywhere. This might be in beginning about the same amount of work as in previsou case, but later on I would expect minimum bugs resulting from forgetting about coding page. That is why I preffer to use fixed UTF-8
You don't have to convince *me* :) That said, I have other Bugzilla stuff I'd prefer to do, so if someone wants to take this, feel free. Any patch needs to come with a script whcih can convert existing content, from at least ascii, utf-8 (ie no change), iso-8859-{1,1+euro}, ISO-JP-{Mumble}. Anyone who has contents in another encoding can patch the script - I think thats likely to cover most of the detectable content. It needs to work on at least perl 5.6.1. Making it work on perl 5.6.0 would probably avoid lots of other arguments, but I'm not sure if that had the required support we need.
It may be useful to know that I believe Simon Cozens has written a number of charset-conversion Perl modules, which you should investigate when writing any conversion scripts. Gerv
Yep, and Encode is standard with 5.8. Problem is that it requires 5.7.1, so.... Maybe we could have the script require 5.8 - it could easily be run offline, and generate sql UPDATE statements rather than modify stuff directly.
Has anyone addressed the issue of how we deal with existing data in many different encodings in the curret Bugzilla? (Maybe this is not a bug for dealing with that?) I think I said this before but if we are going to move to UTF-8 in Bugzilla, then it needs to be done only to new database. The old database should be sent without any charset info. Strategically, it would be something like this: 1. Mark a transition date. 2. After that date, all data would be handled as UTF-8 with input and output processs accurately reflecting UTF-8. 3. Any data prior to that date should not bear any charset info. ================================================================== By the way, I don't understand all the comments about Comm 4.x. It certainly supports UTF-8. It does not have automatic font/glyph finding mechanism as in Netscape 6/7 and IE. What you need to do with Comm 4.x is for users to set a font that supports characters you want to display under Unicode. Edit | Prefs | Appearance | Fonts | For the encoding | Unicode pick some fonts that have a lot of different lang characters like Arial Unicode (for Win). CJK users can simply choose their native fonts and that should work for most situations.
momoi: we can't do that, because coments for teh same bug can happen both before and after. What we need to do is have a script to go through every coment, and try to autodeteect the charset.
See bug 182975 comment 3 for a previous mention of problems with NS4's form submission when using UTF-8. Not sure if that's covered by the workaround Kat mentioned. As for transition, I agree with Brad, we should convert all of the existing data to UTF-8 as best as possible as part of the upgrade once this is included. FYI, MySQL does not appear to support Unicode in any shape or form yet. According to their website, Unicode support is planned for version 4.1 (which isn't out yet). Sybase and Postgres, which are the other two databases we're on the verge of supporting, both do, however.
Alias: bz-charset
Keywords: patch, review
OS: Linux → All
Priority: -- → P1
Hardware: PC → All
What about using the OS/user's default charset ? For example (assuming Solaris): - if the default $LANG is C/POSIX or iso8859-1 we default to iso8859-1. - if the default $LANG is ja_JP.UTF-8 we default to UTF-8 etc.
Regarding comment 73, selecting the database character set based on the locale of the user running the bugzilla instance or the MySQL instance is not going to work: it is probably likely that a server running Bugzilla lacks the locale setting of interest. Further, this is a rather advanced setting: for most people outside of Eastern Europe and Asia will be happy with iso8859-1. People in Eastern Europe and Asia are already aware of the character encoding used on their particular installation, out of necessity. The real problem is when you have a database that contains rows with multiple character sets, where the character sets used for each row are not tagged. Generally this shouldn't happen with Bugzilla and each database should be internally consistent. Of course I can imagine scenarios where it could actually occur. In any event, the approach to take is to export the entire database to text, transcode the rows as appropriate, and import the database back. This is the approach recommended by Oracle through 9i, and was the method used by Amazon.com (for example) when they moved their databases to UTF-8.
I think I will agree on forcing to use UTF-8. Will appreciate on having a detailed guidance on how could I implement the said solution on my environment. Below is the description of my environment: 1) Debian GNU/Linux 2.2.20-idepci ---> from #uname 2) debian package of bugzilla (ver 2.14.2), sendmail and others 3) we changed into Content-type: test/html; charset=euc-jp under /usr/lib/cgi-bin/bugzilla If there is anything you would like to clarify please let me know. Please consider me as a newbie on trying to weave on this kind of huge system. Domo Arigato Gozaimasu (Thank you very much in Japanese)
*** Bug 188745 has been marked as a duplicate of this bug. ***
Comment on attachment 72387 [details] [diff] [review] v11/v1h: Encoding patch for HTML one year worth of bitrot...
Attachment #72387 - Flags: review-
Comment on attachment 72860 [details] [diff] [review] 72387: v12/v2h: Encoding patch for HTML one year worth of bitrot...
Attachment #72860 - Flags: review-
Anyone have an up-to-date patch for this? This would be really cool to get in fairly soon, even if it's optional and only available if you're using Sybase or Postgres (since MySQL doesn't inately support utf-8)
Nothing stops you from storing UTF-8 in a MySQL database: I do it regularly and it works fine. Supposedly MySQL 4.1 will include unicode support, but in the meanwhile it would be nice to have this fixed so those of us using Unicode in MySQL can get our users off our backs complaining about busted subject lines in email. ;-)
Mozilla should render bug pages in standards mode, but it can't until this is fixed. Bug 38856 supposedly fixed this.
The charset encoding has nothing to do with standards mode.
According to my understanding of the pre- and post-filing discussion of bug 196292, the absence of a charset declaration puts Mozilla in quirks mode.
'char' doesn't appear on that page. The offical description is http://www.mozilla.org/docs/web-developer/quirks/doctypes.html - it goes on doctype, not charset
And from the page Bradley mentioned, it says that a page will render in quirks mode if it uses "The public identifier "-//W3C//DTD HTML 4.01 Transitional//EN", without a system identifier.", which is why bug pages render in quirks mode. It is not related to the charset.
Hmm. I thought we had a system identifier. Oh well. Did mozilla's behaviour change at some point, btw? The <img> in the <table> in the header got displayed differently after we added in the doctype way back whenever it was we templatised.
Just confirming that comment 83 is incorrect.
*** Bug 202114 has been marked as a duplicate of this bug. ***
*** Bug 174340 has been marked as a duplicate of this bug. ***
For the record: Adding an encoding to the Content-Type header is now easier since bug 201816 has landed. (A short usage is found in the documentation after bug 201955 is in.) This doesn't solve the general problem though (ISO-8891-1 vs. UTF-8 vs. other encodings), and using correctly encoded mail headers ('to' and 'subject') and mail bodies is still to be done.
It also doesn't actually make any of the data utf8 (you need tags on the <form> for that), nor does it handle data currently in the system which is not utf8. Theres also validation (whcih will probably mean patches to CGI.pm), and some other stuff too.
*** Bug 207960 has been marked as a duplicate of this bug. ***
Blocks: 135762
*** Bug 213864 has been marked as a duplicate of this bug. ***
*** Bug 219257 has been marked as a duplicate of this bug. ***
*** Bug 220066 has been marked as a duplicate of this bug. ***
*** Bug 221838 has been marked as a duplicate of this bug. ***
The problem still exists in 2.16.3 where the HTML pages use UTF-8 encoding. EMail goes out without the proper MIME header. The problem however is not a simple as to add some header: Several MUAs (Mail User Agents) cannot deal with UTF encodings yet, so the preferrable solution would be _reencoding_ the binary message before sending it as EMail. For EMail the ISO-Latin charsets are preferrable for Europe at least, and the encoding should be quoted-printable. Is there a fix for the stable version already?
There are also several MUA:s that can properly handle UTF-8 mail. The best solution is probably to fix those that don't, instead of introducing a reencoding workaround that reintroduces the original problem again.
mutt allows the user to specify a list of preferred character encodings for sending email. If the message can be encoded in the first encoding in the list, it is used, then the second, etc. (IIRC, the default value for that preference is us-ascii, iso-8859-1, UTF-8.) A similar solution would likely work well in Bugzilla.
With regards to comment 100 the user needs to be aware that some encodings are indistinguishable from each other without out-of-band information: all of the ISO 8859-x encodings share the same encoding space, as does EUC-KR and EUC-CN, even though they have very different character sets associated with them. What does MUTT do when the text it wants to send cannot be transcoded to a character set/encoding in the user's list? Latin-1 cannot be transcoded to Shift JIS, for example without loosing the accented characters. And is the system aware of differences within character sets with a single name? On Windows GB2312 implies CP936 which has some 18K more characters than pure GB2312 as would be found on Unix systems (in EUC-CN, for example.) And I assume that, at least internally, MUTT is pivoting through Unicode for this? Otherwise you end up with n**2 tables for n different encodings...
I don't see the point of your out-of-band information comment. Both email and web pages can and should contain encoding information. I don't know what mutt does if the message can't be encoded in any of the encodings. It would only be an issue if utf-8 were removed from the list. Internally, mutt probably uses the relevant library functions such as iconv, which I'm sure go through Unicode internally in some sense or another.
My comment about out-of-band data is that without knowing the language of the comment, you cannot necessarily know which of the 8859-x encodings is correct. If the comment is in Russian then you can trivially convert it to 8859-1 or 8859-2 or 8859-6 or whatever, and not get mapping errors. Similarly, a comment in Chinese in EUC-CN can be transcoded to EUC-KR without an error, but resulting in absolute garbage. Finally, if all of the comments are in Unicode you need to know the language of the comment so you can pick the most appropriate legacy encoding to transcode to without trying them all... unless you only limit yourself to those the user indicates they can process. FWIW, I think giving the user a choice of encoding to transcode mail to prior to sending is a fine idea... it just needs to be handled the right way because you can get burned very quickly.
I agree to comment #99. Bugzilla can't take care of everyone who still lives in the stone age and uses broken email clients like Eudora. Nonetheless, a reasonable fallback (as suggested) can be supported. As for mutt, it does use iconv(3) and Unicode is the internal representation. re : comment #101 Once all textual data in bugzilla is converted to Unicode (at least for bugs filed after D-day or bugs whose comments don't include non-ASCII characters up to D-day), we exclusively deal with Unicode data so that there's no issue with indistinguishability of legacy encodings. re : comment #100. It'd be great if the prioritized list of encodings could be configurable per user. Or, at least, there's an option to get emails exclusively in UTF-8. I don't want to receive ISObug mails in ISO-8859-1 (although I don't have any problem dealing with them with a patched version of Pine that take advantage of iconv(3)).
Re comment 103: as comment 104 says, there should never be any guessing involved, since we should know the encoding of all data.
My personal preference is to use ASCII by default (for email) and fall back on UTF-8 if there are any non-ASCII characters present. Don't even mess with ISO-8859-1. All web interaction will always be UTF-8. Eudora 6 (for the Mac anyway) deals with Unicode just fine. 6.0 was the first version that did, however. 5.x had problems with it in subject lines (it'd just display the raw =?UTF-8?B?foobarbaz?= on the subject lines) but dealt with it fine in message bodies. Anyone still using 4.x or less needs to upgrade. :) Seriously, anyone using an email client old enough to not support that kind of stuff is pretty unlikely to be dealing with bugs that require it. And everyone supports ASCII :) If someone is regularly dealing with internationalized data, then they need to upgrade their software to handle it.
> My personal preference is to use ASCII by default (for email) and fall back on > UTF-8 if there are any non-ASCII characters present. How could you tell the difference between doing that, and just using UTF-8 exclusively? (Other than by looking at the headers, but that would only matter if the UA did so itself, and if it does, then it's unlikely not to support UTF-8.)
My comments about needing to know language so you can reliably transcode to legacy encodings is only applicable if the suggestion in comment 100 that Bugzilla allow the user to specify a list of encodings they are willing to accept. Encoding cannot be trusted to tell you what language a piece of text is in. I can write perfectly normal German in 7-bit US-ASCII. Unicode doesn't help at all, since you loose all language related information. You cannot reliably determine language based on what block a character or set of characters comes from, except in relatively rare circumstances. Knowledge of language is required to reliably transcode to a legacy character set. If all data is in UTF-8, and no transcoding is required, then nothing needs to be done: US-ASCII just works. As soon as you start transcoding between encodings you need to know the language to do it right.
For the record, Zippy's Bugzilla has been in production for almost 4 months now with utf-8 specified as the charset in the headers using Perl 5.6.1 and we've had no issues. Note that it was a new installation that started from scratch with utf-8, and not applying it to any legacy data. I think it would be a piece of cake to make new Bugzilla installations use utf-8. Upgrading existing ones is going to be a can of worms though.
How about a feature to use UTF-8 for all bugs numbered <configurable number here> and up? That would at least allow a transition going forward.
> Knowledge of language is required to reliably transcode to a legacy character > set. Why?
re comment #110: That's what I was saying in comment #104 and what Markus Kuhn suggested more than a year ago in another bug. In addition to new bugs (with bug # > N), old bugs with pure ASCII data at a certain D-day can be 'converted' to UTF-8. To do so, we need to add a boolean field to each bug to indicate whether it's carrying UTF-8 data (because the test on bug # doesn't work for them). re comment #106 and Eudora: I meant to write about Eudora on Windows. For an unknown reason, Eudora-Mac's I18N support has always been ahead of Eudora-Win's I18N support. I believe Eudora-Win still doesn't support UTF-8. Neither does it allow users to choose character encoding for outgoing emails. No support for RFC 2047 header decoding (I'm not sure of the latest version). BTW, there's one more to take care of in bugzilla's migration to UTF-8 only world. I noticed that some Western Europeans had entered their names in bugzilla account in ISO-8859-1 [1] (I have never seen non-Western Europeans enter their names in the corresponding legacy encoding). Perhaps, bugzilla-admin scans the account name field to see if there's any character outside US-ASCII. If there's, send an email to account owners to reenter their names with View|Character Coding set to UTF-8. [1] That's also the case of some xml/xul files (that are supposed to be in UTF-8) in the mozilla source tree.
We can convert existing data to UTF-8 with a reasonable degree of accuracy. We can't get it perfectly correct, of course, but we should be able to come up woith something which is close, and which an admin could script given the requirements of their userbase.
No, I wouldn't dare. In some bugs, a few different encodings (that are all but impossible to distinguish from each other without human intervention especially given that they're usually pretty short, which keep us from using any charset detection based on statistics) are used.
This would be done on a per-comment basis, not a per bug basis. If you mix encodings within a single comment, theres not much anyone can do.
How would you reliably determine the encoding per comment basis without human intervention? That's the whole point of comments by Tom. Mozilla's charset detector or Basistech's similar product sometimes fail even for much longer chunks of text than the usual length of bugzilla comments. Even 95% or 99% is not good enough (not that I think you can reach even 90%) for our purpose. Just leaving them alone is better if you can't get 100%.
If a single bug uses multiple character sets, then it doesn't matter if we screw them up, since no UA is going to ever show all the comments right at the same time anyway.
> it doesn't matter if we screw them up, It does matter. If we just leave *alone* old bugs with non-ASCII comments, we can change the encoding manually to view them correctly (although not all of comments correctly at the same time if multiple encodings are used in a single bug). However, if we screw them up by the incorrect detection, it's a lot harder to view them. We have to figure out not only the original encoding but also the wrong encoding that the charset detector believed comments to be in. Therefore, I'd stand by my comment #112 (dbaron's comment #110). If there is someone who's got the infinite amount of freetime, (s)he may go comment by comment and convert them to UTF-8 manually in the backend DB.
Commercial encoding/language detectors can often do quite well with as little as 96 bytes: our encoding/langauge detector can achieve 98--99% accuracy in its first guess on a buffer of this length, and is almost 100% accurate within the top two or three guesses with 64 bytes or more. In these cases you can take the top three candidates and convert from the hypothesized encoding to Unicode and see how many invalid characters you get in the conversion. The one with the fewest invalid characters wins. Or if there is a tie, you take the first guess. This can work pretty well, and has worked well for companies migrating gigabyte size (and larger) databases. However, these kinds of tools are expensive and not readily available: open sourced converters do not approach this level of accuracy. Detecting within a BZ database though is complicated by the fact that comments may include source code or similar 'noise' that needs to be accounted for. In any event, as Jungshik says, multiple encodings can be displayed by the user manually switching character encodings on the page. I expect this is an issue for those using cyrillic (Russian, Ukrainian) where there are multiple competing encodings in regular use. In any event, this bug isn't the place to discuss UTF-8 migration issues.
*** Bug 225291 has been marked as a duplicate of this bug. ***
If by comment 119 we're not going to discuss migration here (hooray), what stops us from simply adding a charset Param() and getting on with the migration problems when we try to apply it to existing Bugzillas?
*** Bug 226941 has been marked as a duplicate of this bug. ***
re: comment 121: I agree. Let's get on with it. For this to go in right now, we need the following behaviour: - This is implemented as a Param. - For new installations, the Param defaults to "utf-8". - For upgrading installations just picking up this param, it defaults to being disabled. That will get around the problems caused by existing installations having to migrate data, since it won't force them to migrate when this goes in. We can then open another bug for migration paths, and encourage people to submit theirs if they come up with one.
Regarding the patch in attachment 88599 [details] [diff] [review]: Is there a significant reason to write $header =~ s#^Content-Type: text/html$#Content-Type: text/html; charset=$encoding#i; instead of $header .= "; charset=$encoding"; #?
Blocks: 229010
attachment 88599 [details] [diff] [review] doesn't seem to take care of 'bug mail'. If 'UTF-8' (or any other charset) is set in a new installation, bug mails should have [1] Content-Type: text/plain; charset=CHARSET Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Currently, bug mails don't have any of the above. When C-T and C-T-E headers are missing, that is regarded as the RFC 822/RFC 2822 default, which is equivalent to Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit MIME-Version: 1.0 In addition, header fields (e.g. Subject, From, etc) should be encoded per RFC 2047. [2] That is, you can't just send out raw 8bit characters in the message header. Instead, they have to be encoded as following: Subject: =?UTF-8?B?.........?= blahblah =?UTF-8?B?.....?= Subject: =?ISO-8859-1?Q?His=20name=20is=G=F6the?= [1] http://www.faqs.org/rfcs/rfc2822.html http://www.faqs.org/rfcs/rfc2045.html [2] http://www.faqs.org/rfcs/rfc2047.html
Just in case it's not known, MIME-Tools at http://www.zeegee.com/code/perl/MIME-tools/ can make it easy to deal with RFC 2047 header encoding as well as other MIME-related issues. Encode module would be handy when we finally decide to migrate. http://www.cpan.org/modules/by-category/13_Internationalization_Locale/Encode/ Well, these days, there are lots of encoding converters and Mozilla has one if you build intl/uconv/tests, but given that a large part of bugzilla is written in Perl, Encode may have some advantages.
Sorry for spamming. I hadn't read comment #27. It has the following: > Default email setting uses MIME with %encoding% and %transportencoding% > The email body is either send as 8bit or quoted-printable There should be an option to use 'C-T-E' of base64 for text/*. Another option might be necessary to pick whichever is shorter after calculating the length of the encoded result. There's a myth that Q-P is for text/* and Base64 is for binary (image/*, audio/*, etc). That's wrong. For CJK, Russian, Greek, Thai and other non-Western European text, Base64 is more space-efficient and is not much worse than Q-P in terms of 'human readability'. For CJK, Russian, Greek, Thai, and so forth, '=A1=B0=C0!=20"=B1=AC' would be as cryptic as 'xerTgylkRt' if a non-MIME-aware client is used. The same is true of Q encoding and B encoding in RFC 2047-style header encdoing. > Honour RFC 2047 for the encoding of the header Yes, this is important and easy to do with MIME-Tools. >Check whether we need to change something for 16bit characters. >I fear that MIME:QuotedPrint doesn't do the right thing in this case. If '16bit characters' means supporting UTF-16(LE|BE), I guess you don't have to worry. RFC (2)822 email messages are byte-oriented so that I don't think it's possible to send non-byte-oriented data in text/* messages no matter how we encode it. If your concern is that MIME:QuotedPrint splits multiple octets representing a single character (in multibyte encodings such as UTF-8, GB2312, Big5, EUC-KR, ISO-2022-JP) into two neighboring encoded words, that's indeed a problem. My memory is not clear, but the last time I checked, it did the right thing
jshin: I just read your latest bugmail using |less| as I have read my last several thousand bugmails. base64 means I can't use less. That would be a showstopper for me. If we're going to do base64, I'd have to request that users be able to pref against it. I don't care if I can't read CJK bugmails content, I need to be able to quickly scan for bug status changes and attachment creation and flag changes. Note that if i were to actually visit bugzilla.mozilla.co.jp or whatever it is and it were a current bugzilla (say 2.19.1), I'd request that my interface be English instead of the Japanese default. Which means I'd expect to see "Created an attachment" and: jshin@mailaps.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jshin@mailaps.org and ... Even though the actual bugzilla might store Japanese internally. Actually, it probably shouldn't really store strings for this stuff, since I should be able to get those strings in Spanish when I load the bug later.
If the bugmail required base64, there's a 99% chance you wouldn't be able to read it anyway, even if it was ascii+garbage, because it would all be garbage, so you're not losing anything. The trick is you don't convert it to base64 unless it contains 8-bit data. It should be fairly easy to get a percentage of 8-bit data in the mail to be sent... 0% = use us-ascii. < 30% = use Quoted Printable. >30% = use base64. The way we have it set up on Zippy's bugzilla right now it just ships raw 8-bit data in the body of the email (C-T-E: 8bit), but we mime-encode the headers (Subject in particular) using MIME::Words if and only if they actually contain 8-bit data. We probably should use QP or base64 on the content, but in our case it didn't matter because the mail systems between the Bugzilla and the recipients (who were all internal folks) were all 8-bit capable.
The argument for always using quoted-printable is the following: suppose someone who can't deal with base64 gets bugmail containing a long comment in a language that person can't read. He doesn't really care about the comment (or is forced not to, since he can't read it), but he does want to know whether the bug was reassigned or its target milestone was changed. If the message is encoded using quoted-printable, he'll see reassignment or target milestone changes. If it's base64, he won't.
Well, I'm not a big fan of Base64/QP. They're just necessary evils. Most SMPT 'transports' are 8bit clean, these days, but we can't be sure, which is why we need ESMTP negotiation mechanism (unfortunately, not many MTAs/MUAs do that). Even if bugzilla sends out bug mails in 8bit C-T-E, an SMTP server somewhere in the way can turn 8BITMIME to Base64 or QP if it determines that its peer doesn't support 8BITMIME ESMTP extension (see RFC 1652. sendmail 8.x does that by default). All I want is that the door should not be shut for base64 if somebody wants it. As for using 'less' and non-MIME-aware email clients, QP is certainly better than Base64, but still not so convenienent as 8bit. If you really care about it, you'd better set up a local procmail filter that converts incoming emails to 8BITMIME automatically. You can also filter your existing mailboxes through the procmail filter. I've done that for years because I also do want to run 'grep', 'less' and friends on my mail boxes. Actually, I stopped doing it (for single-part text/* messages) when sendmail 8.x began to support automatic decoding of base64/qp to 8BITMIME before delivering incoming emails to mboxes. If you don't control your MTA and can't run procmail on incoming emails, you can still filter your local mailboxes (on disk) through it. Please, don't say everybody doesn't know how to do that. If she uses grep,less etc to look for something in her mailbox, she certainly can (and there are other tools that let you do the equivalent)
Blocks: 110692
Simple question why isn't the trivial fix (of adding "; charset=UNICODE-1-1-UTF-8" to "Content-Type: text/html") applied for 2.16.5?
Ulrich: the Bugzilla Guide outlines why Bugzilla doesn't ship with a default charset. On a separate note, is it a Mozilla bug that this page now has fonts double the size, and complains about Chinese text display? Or is it one of the comments? Gerv
(In reply to comment #132) > Simple question why isn't the trivial fix (of adding "; Please, go through comments in this bug. > charset=UNICODE-1-1-UTF-8" to "Content-Type: text/html") applied for 2.16.5? The preferred MIME name is not 'UNICODE-1-1-UTF-8' but 'UTF-8'
It is also necessary to add an attribute like accept-encoding="UTF-8" to all the forms if you want web browsers to send data back in unicode. Without that, they may use the locale charset instead (even if the page is encoded in UTF-8).
IMO, I think it should be "accept-charset" instead of "accept-encoding". http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset
:: sigh :: :) This is very high on my list, btw, we'll have this in 2.20 or bust. :)
Whiteboard: i18n
Target Milestone: Bugzilla 2.18 → Bugzilla 2.20
BTW not properly encoded headers are one criterium filters like amavisd-new use to detect spam. This means if you use strict filters they'll also eat some of your bugzilla trafic (this is bad) I vote for UTF-8 everywhere with database conversion on upgrade. Other encodings might work better with old tools but variable encoding cause so many problems everywhere it's not even fun to write about here. One encoding to rule them all, period. Screw ancient tools - if they can't grok UTF-8 that means they also need massive amounts of handholding everywhere to work anyway.
(In reply to comment #138) > I vote for UTF-8 everywhere with database conversion on upgrade. And therein lies the problem. As stated in earlier comments here, conversion of an existing database is *almost* impossible, because we have no reliable and accurate way to tell what the existing character sets (note plural) are. Making this all work on a clean install is a cakewalk. Upgrading existing systems is a nightmare.
Summary: Allow administrator to set charset encoding for pages and email → Use UTF-8 (Unicode) charset encoding for pages and email
As an administrator for a db that would need conversion, I can tell you incomplete conversion is the lesser evil there. Undefined encoding is far worse even short-term (since more and more clients do use UTF-8, so you end up with mixed encodings anyway) Just treat all data as iso-8859-15 and convert to UTF-8 so there are no invalid unicode combo left, that's all I ask. Some bugs will be mangled wrongly but they are alreadly mangled in some circumstances *NOW*, when consulted by people with a different client than the original submitter
If we could vote, I would also vote for using UTF-8. Come on, we're in the 21st century, 2004 to be exact, and the encoding problem is still unsolved! When the database is upgraded, is it possible to keep the old data? Normally, yes, right? In this case, there's one solution but that might be quite long to implement: When someone notices that a comment can't display correctly, he could click a link to mark it as badly converted. You know, like the [reply] link that was absent before some version (I don't know which exactly). In order to avoid sabotage, we could only allow the owner of the comment to do this. One step further (even longer to implement): he could be directed to a new page where the comment is fetched from the *old* database and is displayed with a default charset, Latin-1, of course. Then, there's a selector with different encodings. The user chooses (or tries) an encoding and the selector trigger a submit to the server. This time, the server sends back the comment with the demanded encoding (actually, the server's job is just writing a different metatag in HTML/HTTP header). When the user is sure of the encoding, he pushes the submit button and the server, with the chosen encoding, convert the old comment to UTF-8 and put it in the database, replacing the wrong one.
There is no reason to "vote" to use UTF-8. It is pretty clear that that is the preferred option of the bugzilla developers (just look at the title of the bug). The problem is how to get from here (encodings not specified, database may contain comments in a variety of different encodings) to there (all data in UTF-8). Given that all comments for a bug are displayed on a single page, and people are going to want to comment on bugs filed before any switch to a fixed encoding, it will be necessary to convert the old data (having a new bugs vs. old bugs distinction won't work and neither will separate databases/installations). The conversion process will probably need to be customisable for a particular installation though. While treating existing data as ISO-8859-1 or ISO-8859-15 might work for most english or european installations, it would be incorrect for a Japanese bugzilla installation. Probably the right process is to check if each string is valid UTF-8. If it is, leave it. If it isn't, convert it from the encoding the administrator thinks the most of their content is in. If there are other encodings where you can validate a string like you can with UTF-8, the conversion process could be more sophisticated. Unfortunately you can't easily check whether a string is actually ISO-8859-1 or not because pretty much any 8-bit string could be valid ISO-8859-1 (strictly speaking, strings containing characters in the range 0x80 - 0x9F aren't valid ISO-8859-1, but many Windows boxes will send back strings in Windows-1252 encoding when asked for ISO-8859-1).
Here's my suggestion (which probably doesn't say anything new). All Bugzillas have a "charset" param, used in all the appropriate places. All new Bugzillas have this set to "UTF-8". Whenever someone upgrades a Bugzilla from pre-this-change to post-this-change, checksetup asks them for a charset. If they specify one, Bugzilla converts all comments from that charset to UTF-8, and uses UTF-8 in the future. The list of available charsets may be limited, if each one requires explicit support. They also have the option of specifying no charset. If they do that, the param is set to "", and Bugzilla continues to send no charset (i.e. it works exactly as now.) UTF-8 and "" are the only two valid values for the charset after the upgrade. I'm sure that misses something. What? :-) Gerv
For the record, there appear to be at least 15 unique character sets in use in bug data on bugzilla.mozilla.org. Picking "one" to convert from won't even close to work. The best option I can think of is to tell the browser that everything is UTF-8 and provide a "re-encode me" link next to items which we can detect aren't UTF-8 for people who have specific privileges. Clicking that link would then load the item all by itself on a blank page with no charset set, and let their browser auto-detect the charset, and let the user confirm before submitting to re-encode it as UTF-8. Then let people fix the stuff they think is important. I have a prototype of this form submission on landfill somewhere at the moment, I was playing with that idea at one point.
Ick. Can we tell (using JS) what charset the browser has auto-detected? Or would people need to look in Page Info and then choose from a list? How do we detect those items which aren't UTF-8? That is to say, how does the auto-conversion script detect which items to convert and which to flag for "manual" conversion? Is there any way to do this without (re-)writing Mozilla's code to do charset detection for small volumes of text? > Picking "one" to convert from won't even close to work. Depends what you mean by "close". Picking ISO-8859-15 would almost certainly do 99% of comments... Gerv
(In reply to comment #145) > Can we tell (using JS) what charset the browser has auto-detected? Or would > people need to look in Page Info and then choose from a list? document.characterSet > How do we detect those items which aren't UTF-8? That is to say, how does the > auto-conversion script detect which items to convert and which to flag for > "manual" conversion? (I guess it checks if the item is a valid UTF-8 sequence? > Depends what you mean by "close". Picking ISO-8859-15 would almost certainly do > 99% of comments... well, for 99% of the comments, US-ASCII will suffice...
I'm also confused as to why you think ISO-8859-15 is more common than -8859-1. And 99% of comments still leaves many thousands of comments that will become largely unreadable, which is a regression from the current state (which involves the UA guessing at the encoding instead of being forced to use the wrong one). justdave's idea seems like the best so far.
I know that someone else has already suggested the following solution but I can't find it at the moment. I think that this would be the best trade-off solution for new/legacy databases: Add a utf8 flag to each bug. All new bugs automatically have the utf8 flag set. Any bugs which have the utf8 flag set have all of the appropriate HTTP/HTML headers to specify the charset encoding for the bug and form as UTF-8. All other bugs are displayed as currently (no charset). This sorts out the standard bugzilla installation and upgrades and is the only thing needed to be done officially by bugzilla (IMHO). Then, if someone wants to convert existing bugs to UTF-8, then this can be done via an 3rd party tool. e.g. someone creates a utility which can be run on a bugzilla database which (1) tests existing bugs for US-ASCII only text and if so sets the utf8 flag, (2) convert other charsets to UTF-8 from detected or specified charsets if desired.
Here's what the prototype script I have on landfill does... It runs a quick Perl regexp to see if there's any 8-bit data in it. If there isn't, then consider it to be ASCII (which is a subset of UTF-8, and thus safe). If there is 8-bit data, then we use Encode::decode_utf8() to test whether it's valid UTF-8 or not. This isn't 100% accurate, but it's probably better than 99%. This does mean a minimum requirement of Perl 5.8.0 however. <form method="post"> <input type="hidden" name="action" value="update-comment"> <input type="hidden" name="bug_id" value="[% bug_id %]"> Comment: <texarea name="comment">[% comment FILTER html %]</textarea><br> Charset: <input type="text" id="charset" name="charset" value="foo"><br> <script>document.getElementById("charset").value = document.characterSet;</script><br> <input type="submit" value="Update"> </form>
> I'm also confused as to why you think ISO-8859-15 is more common than -8859-1. Because it's in practice a superset (although I doubt there's many Euro characters in Bugzilla, so the difference doesn't really matter.) Brodie: your idea might work if single bugs were the only thing Bugzilla displays, but it doesn't. It displays multiple bugs at once, and buglists, and... Gerv
iso-8859-15 is more than iso-8859-1 + euro There are other new chararacters in there that *are* used in western europe (the characters they replace OTOH can be safely said to be *very* unusual - that's why they were nuked when iso defined -15)
(In reply to comment #142) > There is no reason to "vote" to use UTF-8. It is pretty clear that that is the > preferred option of the bugzilla developers (just look at the title of the bug). My comment is, in fact, in reply to comment #139 which seems to mean we have to postpone database upgrade because we still can't figure out how to do a *complete* conversion and thus data base still remains without encoding. Maybe I got it wrong, but but my "vote" to use UTF-8 means that we have to do it as early as possible. The reason is simple: Later we do the conversion, more loss we're going to get. > While treating existing data as ISO-8859-1 or ISO-8859-15 > might work for most english or european installations, it would be incorrect for > a Japanese bugzilla installation. Sure, but there's no solution to this. We have to be determined or we'll never get out of it. As I've written before, it'd better lose what we got in the past than losing also what we're going to lose in the future. > Probably the right process is to check if each string is valid UTF-8. If it is, > leave it. If it isn't, convert it from the encoding the administrator thinks > the most of their content is in. This is infeasible. Taking bugzilla as an example, it has more than 200000 bugs. How could he check if most of the content is in such or such encoding? You're not supposing he's going to read every bug, are you?
(In reply to comment #147) > I'm also confused as to why you think ISO-8859-15 is more common than -8859-1. > > And 99% of comments still leaves many thousands of comments that will become > largely unreadable, which is a regression from the current state (which involves > the UA guessing at the encoding instead of being forced to use the wrong one). > > justdave's idea seems like the best so far. That was my idea :)
(In reply to comment #148) > Then, if someone wants to convert existing bugs to UTF-8, then this can be done > via an 3rd party tool. e.g. someone creates a utility which can be run on a > bugzilla database which (1) tests existing bugs for US-ASCII only text and if so > sets the utf8 flag, (2) convert other charsets to UTF-8 from detected or > specified charsets if desired. A (existant) bug can have comments of different encodings (i18n related bugs). The detection/conversion should be done with respect to every comment but not the whole bug.
What about Encode::Guess, for more excitement. We can't get 100%, but we can probably get close enough. What about the other issues I've raised, such as database searching support and forms on older browsers?
Re: ISO-8859-1 and ISO-8859-1 Windows-1252 is a proper superset of ISO-8859-1 and more commonly emitted by browsers than ISO-8859-1. It is the safest guess of the three.
"commonly emitted by browsers than ISO-8859-15" I meant
While debating what to do with existing installations of bugzilla, can this please be implemented so that all *new* installations will at least be UTF-8. Then there will not be ever increasing numbers of users will need to go through the pain of this.
> Windows-1252 is a proper superset of ISO-8859-1 How is this true? Aren't both 8-bit encodings with valid glyphs for all octets? Gerv
(In reply to comment #159) > > Windows-1252 is a proper superset of ISO-8859-1 > > How is this true? Aren't both 8-bit encodings with valid glyphs for all octets? The C1 block of ISO-8859-1 (0x80 - 0x9f) doesn't have any graphic characters as its name implies. It's for control characters (usually with no visible representation). Nonetheless, they're characters so that strictly speaking ISO-8859-1 is NOT a proper subset of Windows-1252. However, practically, it can be thought of that way. Anyway, that's quite off-topic here. We should do something about this bug soon and especially, I agree to comment #158.
(In reply to comment #159) > > Windows-1252 is a proper superset of ISO-8859-1 > > How is this true? Aren't both 8-bit encodings with valid glyphs for all octets? Windows-1252 includes printable characters in the C1 block that ISO-8859-(1,15) does not.
(In reply to comment #159) > > Windows-1252 is a proper superset of ISO-8859-1 > > How is this true? It isn't. > Aren't both 8-bit encodings with valid glyphs for all octets? They aren't. For example 0x81 isn't defined in Windows-1252.
I knowingly ignored control characters that are of no practical interest here. As far as *printable* characters are concerned, Windows-1252 is a proper superset of ISO-8859-1. For user-entered bug data it makes sense to consider characters in the 0x80-0x9f range as printable characters from the Windows counterpart of a particular ISO encoding.
I feel it would be useful to split this bug into two - one bug for implementing the optional charset headers and one for making the data of existing databases match it. The implementation as I see it is optionally sending a user defined character encoding for all pages, comments, emails, etc. This would be set to UTF-8 by default in new installs and turned off by default in upgrades (no header = same behaviour as now). This problem seems to be quite well defined and implementable now as evidenced by the patches. The other bug for the problem of upgrading existing installs to use that characterset has already had a lot of discussion and ultimately just needs someone to try implementing a solution to nail down the real problems.
*** Bug 256665 has been marked as a duplicate of this bug. ***
Attached patch updated patch (obsolete) (deleted) — Splinter Review
ok, here's a different patch: - updated against the tip - uses CGI's charset() method, which simplifies things a lot - charset is always utf-8, so the param is now a boolean - param defaults to false (for existing installs), with checksetup setting it to true for new installs todo: - encode email headers as per rfc 2047 - encode email body as base64 (depending on percentage of non-8-bit chars) - set accept-encoding attribute on all forms (template function?) - add test to ensure accept-encoding attribute is always present for a new bug (imho): - migration tools for existing installs
How about splitting the email bits out into a separate bug to make it easier to get traction on this one?
Attached patch utf-8 patch with initial email support (obsolete) (deleted) — Splinter Review
here's *preliminary* email encoding, with the following issues: - not applied to all instances of sendmail - need to move the call to Param - doesn't encode to/from/reply-to - needs loads of testing :) but it's a start.
Attachment #70148 - Attachment is obsolete: true
Attachment #70262 - Attachment is obsolete: true
Attachment #70270 - Attachment is obsolete: true
Attachment #70271 - Attachment is obsolete: true
Attachment #70410 - Attachment is obsolete: true
Attachment #70804 - Attachment is obsolete: true
Attachment #71202 - Attachment is obsolete: true
Attachment #71859 - Attachment is obsolete: true
Attachment #72246 - Attachment is obsolete: true
Attachment #72387 - Attachment is obsolete: true
Attachment #72860 - Attachment is obsolete: true
Attachment #88599 - Attachment is obsolete: true
Attachment #158736 - Attachment is obsolete: true
I have to everytime change the charset to UTF-8 if I want read a bug report from bugzilla.mozilla.org which contains German umlauts and similiar characters or cyrillic chars because no charset is specified and the default is ISO-8859-1 (which is almost right for other pages). Only with UTF-8 the reports are displayed correct. So for me it seems that the bugzilla.mozilla.org data ARE in UTF-8 already.
Regarding comment #169: The problem is worse than that: It's not just a display problem, but an encoding problem: When using Mozilla (not MS-IE), form submissions will also not be in UTF-8. Thus the characters are actually stored with the wrong encoding in bugzilla. This applies to attachments as well. In my bugzilla I have bug reports with mixed encodings. I think we don't want mixed encodings (one encoding per comment or attachment or input field, right?
No longer blocks: 266658
(In reply to comment #169) > I have to everytime change the charset to UTF-8 if I want read a bug report from > [deleted] > displayed correct. So for me it seems that the bugzilla.mozilla.org data ARE in > UTF-8 already. I'm not so sure. Maybe those reports are made by people whose browsers ' encoding is in UTF-8, by chance? I've just made a "bug", actually a test-bed, in bug 266658 in order to avoid pollution to this bug which is mainly for discussion. Everybody, when you want to test in that bug, please write as much info as possible, like the encoding of your browser when you write the test, and in what language you're writing.
There's nothing to test. The problem is well known. Bugzilla comments are in various encodings - UTF-8, ISO-8859-1, GB2312, Shift_JIS, KOI8-R, ISO-8859-2, EUC-KR, Windows-1251, etc because mozilla.org's bugzilla does NOT specify its encoding in HTTP header and html meta tag.What encoding is used in a particular comment is determined by the encoding selected in View | Encoding at the time of posting. To reduce the work required when it's finally decided to move to UTF-8, everybody IS strongly encouraged to set View|Encoding to UTF-8 before posting any non-ASCII comments. In case of attachment, you can explicitly specify the encoding like 'text/html; charset=XXX' or 'text/plain; charset=YYY' so that there's nothing to worry about.
*** Bug 44343 has been marked as a duplicate of this bug. ***
I am concerned about thes patches forcing new installs of bugzilla into UTF-8. I need to be able to file bugs containing pound sign and euro symbols which are only available in ISO-8859-15. If a parameter is added it should be to specify a charset for the installation rather than a boolean switch between "UTF-8" or "let each bug be different". Also, how does this affect the ctype=xml query string argument? At present if I include a pound sign in a bug report, the page generates invalild XML as there is no ISO-8859-15 encoding on the XML so most parsers default to UTF-8 and see the file as invalid (firefox displays the document with ? graphic where the pound sign is, IE a parse error) Finally, there is the importxml.pl script. In version 2.9.1, if I supply it an XML file in UTF-8 that contains a pound sign the expat XML parser correctly rejects the file as invalid XML. If I supply it an XML file in ISO-8859-1 it accepts the file but enters the pound sign into the mysql database with an extra character preceeding it (an A with an accent character above it). I'm assuming this is because importxml.pl is assuming the data to be UTF-8 and not paying attention to what the actual directive in the file is?
(In reply to comment #174) > I am concerned about thes patches forcing new installs of bugzilla into UTF-8. I > need to be able to file bugs containing pound sign and euro symbols which are > only available in ISO-8859-15. If a parameter is added it should be to specify a > charset for the installation rather than a boolean switch between "UTF-8" or > "let each bug be different". Please inform yourself of the things you are talking about before posting in a bug report. UTF-8 supports _all_ characters of ISO-8859-15 and even much more, all above Unicode 127 just happen to be encoded in two bytes rather than one. Currently we have no encoding whatsoever defined in bugzilla, that means basically every sign above Unicode 127 that you use in a bug report is at best illegal, at worst undefined crap. UTF-8 gives the possibility to support most characters used in any country without having to support setting probably differing charsets on all bugs. That means that basically no non-ascii characters can get illegal by moving to UTF-8, as they are illegal and unsupported now, but almost all possible ones will get legal when switching to UTF-8.
*** Bug 275377 has been marked as a duplicate of this bug. ***
glob - You might as well request review, so that we can get some action on this bug. At least we can comment on what we think might be wrong with it.
Assignee: justdave → bugzilla
I don't know where this bug is going to, but let me repeat one of the two points from the initioal report: "Presently the bugzilla webpages don't contain an encoding header." Doing a "grep header *.pl *.cgi", I see many occurrences of "print $cgi->header();". There's nothing wrong with it, except that the CGI.pm documentation says: "The -charset parameter can be used to control the character set sent to the browser. If not provided, defaults to ISO-8859-1." Currently Firefox 1.0 (just to name one example) still thinks pages are ISO-8859-1, and I'll have to switch the charset for every page I visit to UTF-8. Well someone stated bugzilla uses UTF-8 internally, and exclusively. But please: Why don't you tell the browser? If this sounds easy to fix, would you please change those "$cgi->header()" to "$cgi->header('-charset' => 'utf-8')" in the near future? Did I miss something important? Sorry if I sound a bit impatient after almost two years...
Perhaps it would help if you read the bug. Especially the parts about legacy content and such.
(In reply to comment #179) > Perhaps it would help if you read the bug. Especially the parts about legacy > content and such. If you are starting with a new and empty bugzilla and (just for example) German translation, all the "Umlaute" are mis-displayed. This has nothing to do with legacy contents. The longer you wait to decide on the right character encoding, the more "legacy content" you will get. To make things worse (or maybe better) Microsoft's Internet Explorer sets the page encoding (automatically?) to UTF-8, so even when I enter the same words on the same machine with two different browsers, they will use two different character encodings. This has nothing to do with legacy contents. Content is already misdisplayed _now_. Despite of that you could make the character encoding a configurable parameter, so those who thing that everything is OK with ISO-8859-1 can leave everything as it is now. To summarize: It is a bug to use UTF-8 character coding in the HTML when saying the page is encoded as ISO-8859-1 in the HTTP header.
Blocks: bz-recode
Byron Jones: if your patch is ready for review then please request so. Try gerv and justdave like the previous attempted patches. + created bug 280633 for upgrading existing installations so we can stop wasting time in this bug on that subject and changed the summary appropriately.
Summary: Use UTF-8 (Unicode) charset encoding for pages and email → Use UTF-8 (Unicode) charset encoding for pages and email for new installations
Comment on attachment 158835 [details] [diff] [review] utf-8 patch with initial email support This is definitely a really good start. It's bitrotted now though, and getting to to affect all callouts to sendmail should be a piece of cake now that it's all in one place anyway :) Also of note is the TODO item for properly handling encoding of email addresses (only encoding the real-name part and not the address itself), which probably should be done before this goes in as it'll break things for people that have non-ascii chars in their email address. glob: any chance of an update?
Attachment #158835 - Flags: review-
(In reply to comment #182) > (From update of attachment 158835 [details] [diff] [review] [edit]) > This is definitely a really good start. Great to hear that ! > (only encoding the real-name part and not the address itself), which probably > should be done before this goes in as it'll break things for people that have > non-ascii chars in their email address. Non-ascii chars in their email address? You meant IDN in the domain-name part of an email address? Internationalized domain names (I don't think there are many bugzilla acccount holders with IDNs in their addresses) had better be converted to punycode.
> You meant IDN in the domain-name part of an email address? no, as part of the real name. eg. From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com> i think we should ignore IDNs for now. i've started working on getting this patch up to date and in a more workable state.
(In reply to comment #184) > > You meant IDN in the domain-name part of an email address? > > no, as part of the real name. > eg. > From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com> So, the current patch RFC-2047-encodes the whole thing? If so,that's definitely needs to be fixed before landing it. > i think we should ignore IDNs for now. Yeah, that's all right. I've never seen any bugzilla account holder with an IDN in her email address.
(In reply to comment #184) > eg. > From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com> Except make sure that you use quoted-printable (?UTF-8?Q?) instead of Base64 (?UTF-8?B?), if you do encode the header (the average mail client deals better with quoted-printable than Base64). If we required perl 5.8, we could use Encode (which is what I use at my installation in the Czech Republic). However, the internationalization of email header encoding is actually a different bug: bug 110692.
> So, the current patch RFC-2047-encodes the whole thing? If so,that's > definitely needs to be fixed before landing it. no, the current patch doesn't do any encoding of email addresses. that from line was one i picked at random from my mailbox. > However, the internationalization of email header encoding is actually a > different bug: bug 110692. the current patch already encodes the subject, using UTF-8/quote printable.
Attached patch utf-8 v3 (obsolete) (deleted) — Splinter Review
updated utf-8 patch. this patch: - adds a boolean utf-8 parameter, which is enabled by default on new installs, and disabled by default on existing installs if utf-8 is enabled : - page's charset is set to utf-8 - encoding attribute added to xml pages - all emails are encoded: - email charset is set to utf-8 - subjects are utf-8 quotedprint'ed if required - name component of email addresses is qp'ed if required - the body is encoded as: - quotedprint if less that 50% characters require encoding - base64 otherwise
Attachment #116623 - Attachment is obsolete: true
Attachment #158835 - Attachment is obsolete: true
Attachment #173337 - Flags: review?
if that patch goes in, please do file a bug to switch the mail to multipart so that the changes at the top can be qp even if the comments are base64.
I understand that this bug is not supposed to deal with legacy Bugzillas - that's cool. However, to make it easier to fix that problem, could we do the following? Instead of having a "utf8" boolean, have a "charset" parameter, which is blank by default and set to "utf8" on new installs. The behaviour is the same - it just means that when we come to fix legacy installs, we have a mechanism whereby users can choose a different charset which best matches their legacy data, without rearranging the prefs again. Gerv
(In reply to comment #190) I think that's a good idea. It helps localized installations, too -- localized templates often have *some* character encoding.
re: character encodings We should be pushing UTF-8 as the only supported character encoding. This simplifies everything - it allows all languages supported by Unicode to be stored and displayed in bugzilla without changes. The same comment can be any combination of languages. Supported databases only need to support UTF-8, not a list of legacy encodings. If an existing legacy bugzilla installation is a single character set then migrating it to UTF-8 is not difficult. re: templates All templates should be in UTF-8. There is no reason to use legacy encodings but many reasons we shouldn't, e.g. for the reasons above, and for easy replacement of UI languages. Think about a bug database where you wish to have multiple UI languages for the same database. Japanese, Korean and English speakers all can read and write the Japanese bug reports but prefer to have their own language for the UI. If legacy encodings are used for the UI then it is not possible to enter the Japanese bugs when using a korean UI. (This isn't a theoretical problem, it was the situation I had in my last company).
Brodie said everything I was about to say, which is what a lot of web-based product developers must know (but they don't as can be seen in hotmail, yahoo mail, etc).
OK, I guess the real question is: is there some technically feasible solution to upgrading older Bugzillas to UTF-8, assuming the admin can tell us what charset it's in now? Do the relevant Perl modules exist? Do we have any clue which data would need massaging? We don't have to _implement_ it (yet), just determine whether or not it exists. If it doesn't exist, then the UI should have the ability to specify an alternate charset, no matter how hard we push UTF-8 in the docs and defaults. If it does exist, then we can have a "UTF-8 or nothing" switch like now. Gerv
(In reply to comment #194) > OK, I guess the real question is: is there some technically feasible solution to > upgrading older Bugzillas to UTF-8, assuming the admin can tell us what charset > it's in now? Do the relevant Perl modules exist? As you know well, we can't make that assumption about bugzilla.mozilla.org because even in a single bug, multiple different encodings are used at bugzilla.mozilla.org. (We should be able to come up with a few things to do for the migration of bugzilla.mozilla.org, though.[1]) However, if there's such a legacy installation with a single encoding used throughout, one can use 'Encode' module. (http://search.cpan.org/~dankogai/Encode-2.09/Encode.pm) > Do we have any clue which data would need massaging? Any textual data (needless to say, we shouldn't touch attachment even if it's text/*) beyond ASCII (no sane person would have used 7bit encodings like ISO-2022-JP or HZ for bugzilla). [1] Some of them are : 1) send emails to those with their names stored in ISO-8859-1 (I've never seen anyone use non-ASCII characters in encodings other than ISO-8859-1 for their names at bugzilla.mozilla.org) to update their account info in UTF-8. 2) Begin to emit 'charset=UTF-8' for bugs filed after a certain day.(say, 2005-03-01). Do the same for current bugs with ASCII characters alone in their comments and title. 3) For existing bugs, add a very prominent warning right above 'additional comments' text area that 'View | Character Encoding' should be set to UTF-8 if one's comment includes non-ASCII characters. 4) If we really want to migrate all existing bugs to UTF-8, add a button to each comment to indicate the current character encoding. If necessary, this button can be made available only to the select group of people knowledgable enough to identify encodings reliably. 5) search/query may need some more tinkering...
There are modules in Perl to detect and recode character sets. Technically this is not a problem. Dave Miller has some comments on using the browser to detect the charset and then recoding using Perl in bug 280633. He also seems to have scrabbled together a proof of concept of this. Remember also that this is implemented as an option. New databases get it by default. Old databases in legacy encodings don't have to convert their database to use newer versions of bugzilla. They can just leave the utf8 flag set false and everything continues as normal. Of course they will probably get better results and user experience by recoding to utf8, thus we have bug 280633. The possible migration features such as those that Jungshik mentioned should be discussed there.
comments on patch 'utf-8 v3': * desc => 'Use UniCode (UTF-8 character set)', Unicode, not UniCode. As a better description, perhaps... 'Use UTF-8 (Unicode) encoding for all text in Bugzilla. New installations should set this to true to avoid character encoding problems. Existing databases should set this to true only after the data has been converted from existing legacy character encodings to UTF-8 (see bug 280633).' Other than that I don't know perl enough to review properly.
A while back someone proposed on IRC that we turned on UTF-8 for every bug ID that was greater than a certain number. Although that isn't a perfect solution (still mixed character encoding in the database) it would at least make all new bugs and all comments on new bugs forward compatible.
Great idea, but it is a new feature so either a new bug or discuss it on bug 280633. Let's keep discussion here to just support in new installations so that we at least get that functionality ASAP.
Summary: Use UTF-8 (Unicode) charset encoding for pages and email for new installations → Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations
(In reply to comment #198) > A while back someone proposed on IRC that we turned on UTF-8 for every bug ID > that was greater than a certain number. That's not new :-) It was first proposed by Markus Kuhn in 2002(?) and I repeated it a couple of times here (e.g. see comment #195 point #2 ). Anyway, it'll be my last comment here on the migration. If there's anything new to add, I'll add in bug 280633
Anne: that breaks things when you have content from multiple bugs on the same page, such as buglists or longlist.cgi output. I think a Bugzilla needs to be either UTF-8, or a specific charset, or "no charset" (undefined, as now). Having it as > 1 specific charset, or part as some charset and part as no charset sounds like a nightmare. Gerv
(In reply to comment #201) > Anne: that breaks things when you have content from multiple bugs on the same > page, such as buglists or longlist.cgi output. So does the current solution of allowing any encoding. Sending pages that have content from multiple bugs as UTF-8 and sending all bugs with ID > current bug number as UTF-8 seems like a reasonable start to me. We'll stop accumulating content of unknown encoding in new bugs, and there will still be a way to view the content on the older bugs (by viewing the bug as its own page) if there's some content that isn't UTF-8.
Hmm. this is a 'vicious' cycle that I have to break. I'll add my response to comment #201 in bug 280633
Comment on attachment 173337 [details] [diff] [review] utf-8 v3 Hey. I have only small comments, without doing some testing: >+ if ($header !~/[^\x20-\x7E\x0A\x0D]/ and $body !~ /[^\x20-\x7E\x0A\x0D]/) { Yeah, make those regexes a function, like we talked about. :-) >+ $head->mime_attr('content-type' => 'text/plain') unless defined $head->mime_attr('content-type'); *nit* Just break this line and indent the "unless" four spaces (for a total of 8 spaces). >+ if (defined $subject && $subject =~ /[^\x20-\x7E\x0A\x0D]/) { Another place where the function would be cool. It's probably a Bugzilla::Util function, really. >+ foreach my $field (qw(from to cc reply-to)) { Other possible fields are Sender, X-Envelope-To, Errors-To, X-BeenThere, and Return-Path. Usually, though, those don't have names in them. >+ $value =~ s/[\r\n]+$//; I think that any given header should only have one line-ending, right? Unless it's split across several lines, in which case you'd have to remove the line terminators, which I think are semicolons for wrapped headers. If it had more than one line ending, it would be the end of the headers. >+ if ($name =~ /[^\x20-\x7E\x0A\x0D]/) { Another good place for the function. >+ push @addresses, '=?UTF-8?Q?' . encode_qp($name) . '?= <' . $addr->address . '>'; Names can have commas in them. Does this deal with that? Also, does this QP encode the entire name? That will make encoded emails a mess in my Evolution, since it generally doesn't like encoded names. :-( I could live with that, though. Also, the line is *slightly* too long, and needs to be split into two lines. >+ $changed = 1; >+ } else { >+ push @addresses, $addr->format; Why do we even call format on the address, if we haven't changed it? Couldn't we just output it as a raw string? (Or is there some other problem with that that I'm not aware of?) >+ if ($body !~ /[^\x20-\x7E\x0A\x0D]/) { >+ # body is 7-bit clean, don't encode Just reverse the logic, instead of having an empty if. >+ } else { >+ # count number of 7-bit chars, and use quoted-printable if more >+ # than half the message is 7-bit clean >+ my $count = ($body =~ tr/\x20-\x7E\x0A\x0D//); >+ if ($count > length($body) / 2) { >+ $head->replace('Content-Transfer-Encoding', 'quoted-printable'); >+ $body = encode_qp($body); >+ } else { >+ $head->replace('Content-Transfer-Encoding', 'base64'); >+ $body = encode_base64($body); I'd want to test this with a few common mail clients, to make sure that they can actually read Base64 bodies. I seem to recall that some can't, but they all support QP. I'm not sure about that, though. >+ $self->charset(Param('utf8') ? 'UTF-8' : ''); I think that usually the charset is lowercase, in HTTP headers. >-<?xml version="1.0" standalone="yes"?> >+<?xml version="1.0" [% IF Param('utf8') %]encoding="UTF-8" [% END %]standalone="yes" ?> And the same, here. Although here I'm pretty sure it doesn't matter.
Attachment #173337 - Flags: review? → review-
Keywords: relnote
i'll do an updated patch when i have the chance, however i can answer a few of your queries now: > >+ $value =~ s/[\r\n]+$//; > > I think that any given header should only have one line-ending, right? on windows, get() was returning the fields with CRLF, so chomp wasn't stripping them. looking at the code, maybe i don't need to do that anymore. i'll have a play. > >+ push @addresses, '=?UTF-8?Q?' . encode_qp($name) . '?= <' . $addr->address . '>'; > > Names can have commas in them. Does this deal with that? yes. Mail::Address->parse() returns an array of addresses, splitting in the correct location. however i just realised that Mail::Address->name() flips the order of comma seperated names. ie. "jones, byron" becomes "byron jones". i should be using phrase(). > Also, does this QP encode the entire name? no, only characters that require QP'ing > >+ $changed = 1; > >+ } else { > >+ push @addresses, $addr->format; > > Why do we even call format on the address, if we haven't changed it? Couldn't > we just output it as a raw string? (Or is there some other problem with that > that I'm not aware of?) there may be more than on address on the field, so we can't use the raw string. > >+ $self->charset(Param('utf8') ? 'UTF-8' : ''); > > I think that usually the charset is lowercase, in HTTP headers. ok, i'll make it lowercase > >+<?xml version="1.0" [% IF Param('utf8') %]encoding="UTF-8" [% END %]standalone="yes" ?> > > And the same, here. Although here I'm pretty sure it doesn't matter. it's case insentitive, but the xml specs use uppercase, so that's what most people use.
Status: NEW → ASSIGNED
Attached patch utf-8 v4 (obsolete) (deleted) — Splinter Review
this version addresses issues raised. notes: in the parameter description i didn't want to include the bug number, as that's more for the documentation, and it would be confusing if the local bugzilla install had a bug number 280633 i've set MIME::Parser to not use temp files. while the MIME::Parse docs indicate there's a performance hit, as we only parse the header, the temporary objects are always empty. i've added "sender" and "errors-to" to the list of fields to encode email addresses on. the other two x- headers are added by mail servers, so there's no reason to check them here.
Attachment #173337 - Attachment is obsolete: true
Attachment #173713 - Flags: review?
Would there be a performance gain to check for 7-bit clean outside calling the encode_message function? Would that save copying the (header, body) pair a number of times? e.g. ($header, $body) = encode_message($header, $body) if Param('utf8'); becomes # make sure there's work to be done if (Param('utf8') and (!is_7bit_clean($header) or !is_7bit_clean($body))) { ($header, $body) = encode_message($header, $body); }
The full name of the administrator user created by checksetup need to be converted to UTF-8 if this name contains non ASCII chars.
Blocks: 280905
(In reply to comment #208) > The full name of the administrator user created by checksetup need to be > converted to UTF-8 if this name contains non ASCII chars. that's tricky as i can't tell what charset the console is running in. how about i update checksetup to only allow 7-bit clean characters in the admin name, with a comment saying that once bugzilla is running the name can be updated via the webpages?
(In reply to comment #209) > how about i update checksetup to only allow 7-bit clean characters in the admin > name, with a comment saying that once bugzilla is running the name can be > updated via the webpages? I think that's an acceptable solution. Just put the is_7bit_clean function in Bugzilla::Util, and don't "require" Bugzilla::Util until you need it. (Don't "use" it -- that will break checksetup. But you probably know that. :-)) Of course, I think you can pull out the "locale" information from the console, somehow. You could preserve that environment variable, the same way that we currently preserve $ENV{'PATH'}. I'm not sure it would work on Win32, though.
why not using the utf8 perl suport ? http://search.cpan.org/dist/perl/lib/utf8.pm
(In reply to comment #209) > that's tricky as i can't tell what charset the console is running in. nl_langinfo(CODESET) in C; surely perl has something like that too?
(In reply to comment #211) > http://search.cpan.org/dist/perl/lib/utf8.pm That's a pragma to enable/disable utf8 support in the perl source code, and to convert a standard perl string to being a perl string with utf8 encoding. It doesn't recode anything. (Read the page.) So, it doesn't really have any use, here. We're encoding in quoted-printable. That has nothing to do with the above link. Finally, Encode.pm support (the built-in perl charset converter) is only available in perl 5.8, and we require perl 5.6.
(In reply to comment #212) > > that's tricky as i can't tell what charset the console is running in. > nl_langinfo(CODESET) in C; surely perl has something like that too? even if we could detect the charset, i'd have to worry about conversion from the detected charset to utf8. sure, there's modules that'll help, but it'd be a lot of work for something with a trivial workaround.
Attachment #173713 - Flags: review?
Attached patch utf-8 v5 (obsolete) (deleted) — Splinter Review
adds the administrator name checking to checksetup, and the optimisation suggested by brodie.
Attachment #173713 - Attachment is obsolete: true
Attachment #174022 - Flags: review?
Note that this patch doesn't apply for BugMail.pm cleanly due to one line changed in bug 280973. Sorry for not providing an updated patch, but it's a bit hard to do on the setup I'm on right now.
Attached patch utf-8 v6 (obsolete) (deleted) — Splinter Review
fixed bitrot; thanks Håvard
Attachment #174022 - Attachment is obsolete: true
Attachment #174022 - Flags: review?
Attachment #174458 - Flags: review?
This patch does not apply cleanly for me against yesterday's CVS, there are problems in checksetup.pl and BugMail.pm. It seems to be simple bitrot issues, but I can't get at CVS from here to fix them properly at the moment.
Attached patch utf-8 v7 (obsolete) (deleted) — Splinter Review
bitrot fixes
Attachment #174458 - Attachment is obsolete: true
Attachment #176018 - Flags: review?
Attachment #174458 - Flags: review?
*** Bug 285255 has been marked as a duplicate of this bug. ***
*** Bug 279589 has been marked as a duplicate of this bug. ***
Blocks: 279589
No longer blocks: 279589
Flags: blocking2.20?
Hmm, this is damn close to ready to go, I don't want to lose it.
Flags: blocking2.20? → blocking2.20+
Comment on attachment 176018 [details] [diff] [review] utf-8 v7 >Index: checksetup.pl >+ # As it's a new install, enable UTF-8 >+ SetParam('utf8', 1); I'm not sure if the new admin check is the best place to check this. Lots of folks upgrading from 2.16 or earlier are going to get nailed with this dialog even when it's not a new install because they twiddled with the bits on their admin account. We should try to find some other way to ensure that it's a new install. >Index: Bugzilla/CGI.pm >+ $self->charset(Param('utf8') ? 'utf-8' : ''); Nit: 'UTF-8' should be all uppercase in the header, to follow RFC 3629 section 8. (technically the field isn't case-sensitive, but since it's defined that way in the RFC we should follow it) Rest of this looks good to me. Find a better way to detect a new install and this has an r+ from me.
Attachment #176018 - Flags: review? → review-
Attached patch utf-8 v8 (obsolete) (deleted) — Splinter Review
improved "new install" detection -- if we have to create data/nomail, it's a new install.
Attachment #176018 - Attachment is obsolete: true
Attachment #177578 - Flags: review?
Comment on attachment 177578 [details] [diff] [review] utf-8 v8 ok, all code style and architecture nits addressed, actually tried testing it now... The summary encoding is not happening correctly. I'm not sure what's wrong with it, but Eudora is showing decoded summaries with an extra = on the end, and Thunderbird is outright refusing to decode them. Subject: =?UTF-8?Q?[Bug 579] This is a s=C3=BCmm=C3=A1ry= ?=
Attachment #177578 - Flags: review? → review-
We need to require MIME::Base64 v3.03 also. MIME::Tools doesn't explicitly prereq it, but it won't install due to test failures if you don't have at least that version. It'll be fewer tech support problems for us if we just outright require it to save people from getting the install errors on MIME::Tools.
Attached patch utf-8 v9 (obsolete) (deleted) — Splinter Review
fixes subject encoding requires MIME::Base64 (version 3.01 on windows, 3.03 on unix)
Attachment #177578 - Attachment is obsolete: true
Attachment #177580 - Flags: review?
Comment on attachment 177580 [details] [diff] [review] utf-8 v9 woot!
Attachment #177580 - Flags: review? → review+
Attached patch utf-8 v10 (obsolete) (deleted) — Splinter Review
<justdave> hmmm..... <justdave> actually, can we swap the order of MIME::Tools and MIME::Base64 in the modules list? <justdave> MIME::Base64 is the prereq and since people tend to work top to bottom...
Attachment #177580 - Attachment is obsolete: true
Attachment #177581 - Flags: review?
Attachment #177581 - Flags: review? → review+
woot! woot!!
Flags: approval+
(In reply to comment #225) > The summary encoding is not happening correctly. I'm not sure what's wrong > with it, but Eudora is showing decoded summaries with an extra = on the end, > and Thunderbird is outright refusing to decode them. > > Subject: =?UTF-8?Q?[Bug 579] This is a s=C3=BCmm=C3=A1ry= ?= If my memory is correct, that form is not allowed to have spaces within it. All the words that are ASCII should be passed through outside the =?-escaped form, and all the words that are not ASCII should be escaped separately. For the details, see ftp://ftp.rfc-editor.org/in-notes/rfc2047.txt (which I haven't really looked at while writing this comment; the above is from memory).
(or, alternatively, the spaces inside it could be escaped, but then you risk hitting the 75-character limit)
bah! so close! :) but he's right....
Flags: approval+
Attachment #177581 - Flags: review+ → review-
Do we not have more XML outputs than just show.xml.tmpl which need the encoding defined? Surely the best test for a new install is if we are creating localconfig/creating the database? Gerv
> Do we not have more XML outputs than just show.xml.tmpl which need the > encoding defined? ahhh, you're correct. template/en/default/bug/show.xml.tmpl template/en/default/config.rdf.tmpl template/en/default/list/list.rdf.tmpl template/en/default/list/list.rss.tmpl template/en/default/reports/duplicates.rdf.tmpl > Surely the best test for a new install is if we are creating localconfig no, because when we create localconfig, data/params hasn't been created; it's created in the second phase of first-time checksetup. > creating the database? i normally manually create an empty database before kicking off checksetup, as i have to set the access permissions for the bugzilla account anyhow. so i have new installs with an existing, but empty, database.
> no, because when we create localconfig, data/params hasn't been created; it's > created in the second phase of first-time checksetup. Right then - so let's do it in the second phase, when we create data/params. Gerv
i've hit a snag. even if the header is folded correctly, Mail::Mailer strips \n\s* from the lines, removing the folding, so lines can break rfc by exceeding the max length. grr
spaces inside =?Q? must be encoded as an underscore (_), see ftp://ftp.rfc-editor.org/in-notes/rfc2047.txt, 4.2(2)
Attached patch utf-8 v11 (obsolete) (deleted) — Splinter Review
ok, this is probably the best i can do without rewriting a whole lot of other modules. this version encodes only the words in the subject that require encoding, rather than the whole line. this avoids any spaces issues, and makes the line easier to wrap. the sub normally generates a header that is wrapped at 75 characters, using Mail::Headers's folding code. however Mail::Mailer kindly strips \n's from the header, resulting in lines that are longer than 75 characters being sent. Mail::Header's folding code is very simple -- it breaks the line on whitespace only. thus even if Mail::Mailer didn't unfold the header lines, it's possible (but unlikely) that we'll still generate >75 character lines. so, here's a solution that appears to work but is not rfc compliant. i've contacted the author of Mail::Mailer and Mail::Header, so another solution may be to wait for these issues to be fixed upstream.
Attachment #177581 - Attachment is obsolete: true
Attachment #178110 - Flags: review?
Depends on: 287064
(In reply to comment #240) I don't quite understand some parts: +sub encode_qp_words($) { + my ($line) = (@_); + + my $line = encode_qp($line, ''); + $line =~ s/ /=20/g; Shouldn't you replace SPC with '_'? + return "=?UTF-8?Q?$line?="; Is this an unconditional return? Looks like it. Will the rest ever be considered? + + my @encoded; + foreach my $word (split / /, $line) { Are there any SPCs left? + if (!is_7bit_clean($word)) { + push @encoded, '=?UTF-8?Q?' . encode_qp($word, '') . '?='; + } else { + push @encoded, $word; + } + } + return join(' ', @encoded); +}
Comment on attachment 178110 [details] [diff] [review] utf-8 v11 > + $line =~ s/ /=20/g; > > Shouldn't you replace SPC with '_'? the rfc allows for _ or =20 : The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.) > + return "=?UTF-8?Q?$line?="; > > Is this an unconditional return? Looks like it. > Will the rest ever be considered? d'oh, those three lines are debug code and shouldn't be there. thanks for pointing that out.
Attachment #178110 - Attachment is obsolete: true
Attachment #178110 - Flags: review?
Attached patch utf-8 v12 (obsolete) (deleted) — Splinter Review
Mail::Mailer version 1.67 fixes the bugs that were stopping us from using it. This patch bumps up the minimum version, and addresses the other outstanding issues.
Attachment #179244 - Flags: review?
Blocks: 281522
Blocks: 287684
Blocks: 287682
note that it's still possible for us to generate emails with lines greater than 75 characters, if the subject doesn't contain any spaces we don't have a point to wrap it at. i know how to fix this, but it's a fair amount of work, so i'd prefer for that to be covered in another bug. note that the current bugzilla code can also generate >75 char lines as there's no checks in place to stop this .. for example if the url is too long, eg "http://you-havent-visited-editparams.cgi-yet/userprefs.cgi" the "Configure bugmail" line in the message footer will be more than 75 characters.
(In reply to comment #244) > note that it's still possible for us to generate emails with lines greater than > 75 characters, if the subject doesn't contain any spaces we don't have a point > to wrap it at. Why would you consider this to be a problem? From RFC 2822: 2.1.1. Line Length Limits There are two limits that this standard places on the number of characters in a line. Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF. So IMO, you are following the spirit of the RFC and are wrapping when possible; sometimes as you pointed out that is not possible. I would find the alternative of MIME encoding the subject lines that are >75chars to be a much worse solution as I would have to assume that the only mail agents still susceptible to being bit by the 78 character recommended limit to be extremely old and therefore wouldn't understand MIME encoding anyway.
(In reply to comment #245) > > [...] generate emails with lines greater than 75 characters > Why would you consider this to be a problem? > From RFC 2822: [snip] See RFC 2047, specifically (from its section 2): An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used. While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more 'encoded-word's is limited to 76 characters. Current patch would fail to meet this requirement if someone creates a summary with too many consecutive non-spaces so that an 'encoded-word' longer than 75 characters is created (which mail programs etc. may not recognise). Solution (as described in the RFC) is to break up the text into smaller chunks creating multiple encoded-word entities each <= 75 characters, but this can just as well be done after this patch lands.
"If it's not a regression from 2.18 and it's not a critical problem with something that's already landed, let's push it off." - Dave
Flags: blocking2.20+
Whiteboard: i18n → i18n [wanted for 2.20]
Flags: blocking2.20-
(In reply to comment #247) > "If it's not a regression from 2.18 and it's not a critical problem with > something that's already landed, let's push it off." - Dave So continuing to fill bugzilla databases with unrecoverable undefined **** & having notification mails vanish out of existence because they're so badly formated any spam checker will mistake them for spam is not a critical problem ?
(In reply to comment #248) > (In reply to comment #247) > > "If it's not a regression from 2.18 and it's not a critical problem with > > something that's already landed, let's push it off." - Dave > > So continuing to fill bugzilla databases with unrecoverable undefined crap & > having notification mails vanish out of existence because they're so badly > formated any spam checker will mistake them for spam is not a critical problem ? It's not a critical problem in something that has landed during this cycle; it has always been there. So it isn't a blocker for 2.20. I'm sure a complete set of patches would still be accepted, though. Perhaps you could do a review on the current set?
My Bugzilla Version 2.17.7 . Where I can download the "utf-8 v12 " files? And when I get the patch files and overwrite the same file, It's will take effect?
Who can help me?! Where can I get the patch files and how to use it? I am collapsing for the confused character problem. My version is 2.17.7 . Thanks all.
After some discussions on IRC, it's become apparent that this is potentially destablizing, as there are uncertainties about how searching will be affected and what kind of problems we'll run into by not using Perl's utf-8 support. I'm perfectly willing to check this in and iron out said problems afterwards, but not while we're in a release freeze, and not on a stable branch. Pushing this off to 2.22. I'd really like to land this as soon as possible after we branch for 2.20 though.
Whiteboard: i18n [wanted for 2.20] → i18n
Target Milestone: Bugzilla 2.20 → Bugzilla 2.22
*** Bug 298243 has been marked as a duplicate of this bug. ***
OK, we've branched, and the trunk is open. Let's get this thing reviewed and landed! :)
Comment on attachment 179244 [details] [diff] [review] utf-8 v12 Hit by bitrot, but trivial unrotting -- r=wurblzap on an unrotted patch. In a follow-up bug, we need to find a way to stop substr() from splitting UTF-8 characters in half :/ Glitches in standards compliance should imho be handled in post-checkin fixes. Tested on Windows, using smtp and testfile as mail_delivery_method. Couldn't get my hands on MIME-tools 5.417, but it works for me with 5.411a just as well. Works for newchangedmail, passwordmail, flag mail. Tested both quoted-printable and base64 encodings (forced base64 by turning the 8-bit-content check around). Tested 7-bit-only mails. Let's do it :)
Attachment #179244 - Flags: review? → review+
*** Bug 175782 has been marked as a duplicate of this bug. ***
Flags: approval+
Attached patch utf-8 v12 unrotted (deleted) — Splinter Review
Unrotted the original patch so that it may be checked in. Fixing 011pod.t complaint, too.
Attachment #179244 - Attachment is obsolete: true
Attachment #191577 - Flags: review+
Checking in checksetup.pl; /cvsroot/mozilla/webtools/bugzilla/checksetup.pl,v <-- checksetup.pl new revision: 1.420; previous revision: 1.419 done Checking in defparams.pl; /cvsroot/mozilla/webtools/bugzilla/defparams.pl,v <-- defparams.pl new revision: 1.163; previous revision: 1.162 done Checking in Bugzilla/BugMail.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/BugMail.pm,v <-- BugMail.pm new revision: 1.42; previous revision: 1.41 done Checking in Bugzilla/CGI.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/CGI.pm,v <-- CGI.pm new revision: 1.18; previous revision: 1.17 done Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.34; previous revision: 1.33 done Checking in template/en/default/config.rdf.tmpl; /cvsroot/mozilla/webtools/bugzilla/template/en/default/config.rdf.tmpl,v <-- config.rdf.tmpl new revision: 1.5; previous revision: 1.4 done Checking in template/en/default/bug/show.xml.tmpl; /cvsroot/mozilla/webtools/bugzilla/template/en/default/bug/show.xml.tmpl,v <-- show.xml.tmpl new revision: 1.8; previous revision: 1.7 done Checking in template/en/default/list/list.rdf.tmpl; /cvsroot/mozilla/webtools/bugzilla/template/en/default/list/list.rdf.tmpl,v <-- list.rdf.tmpl new revision: 1.5; previous revision: 1.4 done Checking in template/en/default/list/list.rss.tmpl; /cvsroot/mozilla/webtools/bugzilla/template/en/default/list/list.rss.tmpl,v <-- list.rss.tmpl new revision: 1.4; previous revision: 1.3 done Checking in template/en/default/reports/duplicates.rdf.tmpl; /cvsroot/mozilla/webtools/bugzilla/template/en/default/reports/duplicates.rdf.tmpl,v <-- duplicates.rdf.tmpl new revision: 1.2; previous revision: 1.1 done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
FYI - if a site admin changes CGI.pm to reflect UFT for the UTF-8 security fixes, a CVS update will fail. bugzilla/docs/html/security-bugzilla.html should be changed to reflect the fixes for this change. -------------- <<<<<<< CGI.pm # Make sure that we don't send any charset headers $self->charset('UTF-8'); ======= # Send appropriate charset $self->charset(Param('utf8') ? 'UTF-8' : ''); >>>>>>> 1.18 -------------- thx tim
Flags: documentation?
Attached patch Documentation patch (obsolete) (deleted) — Splinter Review
Attachment #192066 - Flags: review?(documentation)
Comment on attachment 192066 [details] [diff] [review] Documentation patch >Index: docs/xml/security.xml >- incorporate by default the code changes suggested by >+ <para>If you installed Bugzilla version 2.20 or later from scratch, Wasn't this checked in on trunk-only - therefor it is Bugzilla version 2.22 or later? >+ This is because due to internationalization concerns, we are unable to >+ turn the <emphasis>utf8</emphasis> parameter on by default for upgraded >+ installations. This sentence doesn't read correctly to me...
Attachment #192066 - Flags: review?(documentation) → review-
(In reply to comment #261) > Wasn't this checked in on trunk-only - therefor it is Bugzilla version 2.22 or > later? True. > This sentence doesn't read correctly to me... Ok. I'm no native speaker -- please give me a good sentence, and I'll put it into a patch.
(In reply to comment #262) > Ok. I'm no native speaker -- please give me a good sentence, and I'll put it > into a patch. Apparently it does make sense to others... so just fix the first bit :)
Attached patch Documentation patch 1.2 (deleted) — Splinter Review
Attachment #192066 - Attachment is obsolete: true
Attachment #195374 - Flags: review?(documentation)
Comment on attachment 195374 [details] [diff] [review] Documentation patch 1.2 r=me by inspection....
Attachment #195374 - Flags: review?(documentation) → review+
Docs (attachment 195374 [details] [diff] [review]): Checking in docs/xml/security.xml; /cvsroot/mozilla/webtools/bugzilla/docs/xml/security.xml,v <-- security.xml new revision: 1.8; previous revision: 1.7 done
Flags: documentation?
*** Bug 318151 has been marked as a duplicate of this bug. ***
*note* test at http://landfill.bugzilla.org/bugzilla-tip : If the name of a saved search contains UTF-8 characters they display wrong. "Frédéric" would display as "Fr�d�ric". regards reinhardt [[user:gangleri]]
(In reply to comment #268) > If the name of a saved search contains UTF-8 characters they display wrong. Works for me. If the saved search was created before the switch to UTF-8, then yes, this is possible, but not a bug -- conversion is handled by bug 280633. If you get broken characters with a newly saved search, then please file a new bug.
(In reply to comment #269) > Works for me. > > If the saved search was created before the switch to UTF-8, then yes, this is > possible, but not a bug -- conversion is handled by bug 280633. If you get > broken characters with a newly saved search, then please file a new bug. opened Bugzilla Bug 318583 bug at landfill.bugzilla.org : if a search is saved with a name containing UTF-8 characters this name is not shown properly at all pages containing "Saved Searches:"
*** Bug 316836 has been marked as a duplicate of this bug. ***
*** Bug 319343 has been marked as a duplicate of this bug. ***
Why can't bugzilla also use a META tag for content-type? How is doing that broken?
Added to the Bugzilla 2.22 Release Notes in bug 322960.
Keywords: relnote
There is some issues when I use bugzilla with utf-8. With Mysql db: 1) Mysql connection should be utf-8, this is enabled by this patch: Index: Bugzilla/DB/Mysql.pm =================================================================== RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v retrieving revision 1.36 diff -r1.36 Mysql.pm 70a71,76 > $self->do ("set session character_set_results=utf8"); > $self->do ("set session character_set_client=utf8"); > $self->do ("set session character_set_connection=utf8"); > $self->do ("set session character_set_database=utf8"); > $self->do ("set session character_set_server=utf8"); > 2) Summary field in search results (aka bug list) get trimmed right between bytes in unicode character, and lenght of a Summary is about 30 characters for Russian, not 60. Before "..." you can see bad symbol. See patch: Index: Bugzilla/Template.pm =================================================================== RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Template.pm,v retrieving revision 1.41 diff -r1.41 Template.pm 272a273,275 > > my $utf8_string = $string; > utf8::decode ($utf8_string); 274c277 < return $string if !$length || length($string) <= $length; --- > return $string if !$length || length($utf8_string) <= $length; 277c280,283 < my $newstr = substr($string, 0, $strlen) . $ellipsis; --- > my $newstr = substr($utf8_string, 0, $strlen) . $ellipsis; > > utf8::encode ($newstr); 3) Comments wrapped not as Unicode text, but as byte coded. As result lenght of each string of Russian text is about 40 characters, not 80. Patch for this: Index: Bugzilla/Util.pm =================================================================== RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v retrieving revision 1.45 diff -r1.45 Util.pm 30a31 > use utf8; 230a232,233 > utf8::decode($comment); > 247a251 > utf8::encode($wrappedcomment); 5) Search with Russian text is impossible, only ASCII. With Postgresql db, created with "-E UNICODE": 1) Perl strings not marked unicode strings, perl not use utf-8. Patch: Index: Bugzilla.pm =================================================================== RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v retrieving revision 1.29 diff -r1.29 Bugzilla.pm 26a27 > use encoding 'utf8'; Index: Bugzilla/DB/Pg.pm =================================================================== RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Pg.pm,v retrieving revision 1.18 diff -r1.18 Pg.pm 69,70c69,78 < < my $self = $class->db_new($dsn, $user, $pass); --- > my $attributes = { RaiseError => 0, > AutoCommit => 1, > PrintError => 0, > ShowErrorStatement => 1, > HandleError => \&_handle_error, > TaintIn => 1, > FetchHashKeyName => 'NAME', > pg_enable_utf8 => 1}; > > my $self = $class->db_new($dsn, $user, $pass, $attributes); After this patch imported DB displayed correctly, searches with russian are worked, but bugzilla installation is not usable - new comments, bugs and other strings are saved in db in bad unicode strings. My point of view is if UTF-8 is declared as encoding for new installation, than all bugzilla perl scripts MUST work with strings as Unicode strings, not as bytes. All issues above is about this.
Attached file test, please ignore (obsolete) (deleted) —
please ignore, just test attach zip
Attachment #283019 - Attachment is obsolete: true
The content of attachment 283019 [details] has been deleted by Dave Miller <justdave@bugzilla.org> who provided the following reason: irrelevant to this bug The token used to delete this attachment was generated at 2007-10-01 09:55:04 PDT.
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: