Closed
Bug 389056
Opened 17 years ago
Closed 17 years ago
Don't break line between periods and quote
Categories
(Core :: Layout: Text and Fonts, defect)
Core
Layout: Text and Fonts
Tracking
()
RESOLVED
FIXED
People
(Reporter: jruderman, Assigned: masayuki)
References
()
Details
(Keywords: regression, testcase)
Attachments
(4 files, 21 obsolete files)
(deleted),
text/html
|
Details | |
(deleted),
patch
|
roc
:
review+
roc
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
roc
:
review+
roc
:
superreview+
roc
:
approval1.9+
|
Details | Diff | Splinter Review |
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.9a7pre) Gecko/2007072004 Minefield/3.0a7pre
<div style="width: 1px">"foo..."</div>
Renders as
"foo...
"
This looks silly. Apparently it only happens if there are two or more periods.
Reporter | ||
Comment 1•17 years ago
|
||
This regressed some time in July. Guessing it's due to the change in bug 255990.
confirmed on win XP.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a7pre) Gecko/2007072005 Minefield/3.0a7pre
(btw <div style="width:1px"> foo=bar </div> renders as
foo=
bar
)
(In reply to comment #2)
> foo=
> bar
This is intended, of course.
Sorry for the spam.
j.j.
Assignee | ||
Comment 4•17 years ago
|
||
This can be fixed easy, I think.
# The equal issue is designed. We should break in the point for the URL params.
# And please file new bugs for other issues!
I think that always we should not break around |"|. We defined it as "starting parenthesis" (e.g., like as |(|). But it is often used as "finished parenthesis". It should be character class, I think. (|'| is character class.)
I'll attach a patch.
Assignee: nobody → masayuki
OS: Mac OS X → All
Hardware: PC → All
Version: unspecified → Trunk
Assignee | ||
Comment 5•17 years ago
|
||
(In reply to comment #4)
> I think that always we should not break around |"|. We defined it as "starting
> parenthesis" (e.g., like as |(|). But it is often used as "finished
> parenthesis". It should be character class, I think. (|'| is character class.)
Oops, this comment is wrong...
U+0022(") is character
U+0027(') is character
Assignee | ||
Updated•17 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•17 years ago
|
||
I think that this patch can fix this and similar issues.
1. If current and next character is U_PERIOD, current character class should be 8 (character).
2. ContextualAnalysis should be able to know the previous character class. Because the previous character may be "contextual analysis"ed.
3. "," should be character class as default. e.g., we have same problem in |i.e.,"|
Attachment #273311 -
Flags: superreview?(roc)
Attachment #273311 -
Flags: review?(roc)
Assignee | ||
Comment 7•17 years ago
|
||
(In reply to comment #6)
> 1. If current and next character is U_PERIOD, current character class should be
> 8 (character).
Or, should we define the default class is 8? (U_PERIOD default class is 1, it means "end parenthesis".)
Perhaps we could fix this (and some other issues) in a simple way by requiring that there be at least one white-space or non-punctuation character between any two breaks (including treating the end of the string as a break)? This means that we would never break inside any string of punctuation at the end of a word.
However currently we don't distinguish punctuation from normal characters in the character classes. How hard would that be?
Assignee | ||
Comment 9•17 years ago
|
||
I think that we have two ways:
1. we should have punctuation tables.
2. we should add 'a' class for punctuations.
I think that 2 is better. It makes simple code.
Assignee | ||
Comment 10•17 years ago
|
||
The punctuations are several classes. There are connector, dash, open, close, initial quote, final quote and others, so, we may not be able to fix this by "2" easy. "1" may be better...
We don't need to distinguish all those, right? We just need one new class for all the punctuation that currently is classified as "character".
BTW can you add comments somewhere explaining what all the (compressed) classes actually mean?
Assignee | ||
Comment 12•17 years ago
|
||
This is a draft for new approach, but I have following issues:
1. The class name translating of JIS X 4051 is so hard for me. I need help of native speakers.
2. Need to change over U+0080 too.
3. Now, we are using class 7 (Alphabet) *only* for U+2126 (OHM SIGN), I think that it should be class 8 (Character), therefore, we can remove class 7. (And we can use class 7 for new class.)
Attachment #273311 -
Attachment is obsolete: true
Attachment #273311 -
Flags: superreview?(roc)
Attachment #273311 -
Flags: review?(roc)
Assignee | ||
Comment 13•17 years ago
|
||
er, the table of breakable (section 5) is wrong, we should break between [c] and (16 or 17 or 18).
1 [a] 7 8 9 [b]15 16 18 COMPLEX [c]
1 X X X X X X X X X X X
[a] X
7 X X
8 X X
9 X
[b] X
15 X X X X X
16 X X X X
18 X X X X X
COMPLEX X T
[c] X X X X X
should be:
1 [a] 7 8 9 [b]15 16 18 COMPLEX [c]
1 X X X X X X X X X X X
[a] X
7 X X
8 X X
9 X
[b] X
15 X X X X X
16 X X X X
18 X X X X X
COMPLEX X T
[c] X X
But I think that we need special handling for period (for file path) and single/double quote(They can be both open/close parenthesis).
Assignee | ||
Comment 14•17 years ago
|
||
er, I'm being confused...
please ignore comment 13.
the table should be:
1 [a] 7 8 9 [b]15 16 18 COMPLEX [c]
1 X X X X X X X X X X X
[a] X X
7 X X
8 X X
9 X
[b] X
15 X X X X X
16 X X X X
18 X X X X X
COMPLEX X T
[c] X X X X X
We don't need to break after [c] when next is character/numeric... And we should not break before [c] when prev is [a], right?
> 1. The class name translating of JIS X 4051 is so hard for me. I need help of
> native speakers.
What you have there is pretty good except I don't know what some of them are. In particular can you explain a bit more:
+ 5: Middle dot
+ 8: Prefix (ellipsis) symbol
+ 9: Suffix (ellipsis) symbol
+ 19: Split line note (Warichu) begin quote
+ 20: Split line note (Warichu) end quote
?
Yes, make OHM a character.
#define NUMERIC_CLASS 6 // JIS x4051 class 15 is now map to simplified class 6
#define CHARACTER_CLASS 8 // JIS x4051 class 18 is now map to simplified class 8
+#define PUNCTUATION_CLASS 0xA // see bug 389056
Also create #defines for all our compressed classes. I'll help you come up with names.
Assignee | ||
Comment 16•17 years ago
|
||
(In reply to comment #15)
> > 1. The class name translating of JIS X 4051 is so hard for me. I need help of
> > native speakers.
>
> What you have there is pretty good except I don't know what some of them are.
> In particular can you explain a bit more:
> + 5: Middle dot
e.g., ·
> + 8: Prefix (ellipsis) symbol
e.g., $, ¥ and "No."
> + 9: Suffix (ellipsis) symbol
e.g., cent sign, % and degree sign
> + 19: Split line note (Warichu) begin quote
> + 20: Split line note (Warichu) end quote
http://www.w3.org/TR/2003/CR-css3-text-20030514/#text-combine-prop
Warichu is a ruby layout pattern. It will be styled as "text-combine: lines;".
19 is open parenthesis of Warichu, 20 is close it. We don't support these class.
Assignee | ||
Comment 17•17 years ago
|
||
I'm rearranging the definition table (jisx4501class.txt). bug 388096 will be fixed by the next patch.
Blocks: 388096
Assignee | ||
Comment 18•17 years ago
|
||
I misunderstood the class 7 of JIS X 4051. It is not a class for hyphens. It means that the characters are breakable before and after, but they are not breakable between same class characters.
Attachment #273615 -
Attachment is obsolete: true
Assignee | ||
Comment 19•17 years ago
|
||
And this patch changes the class of '<' and '>' to new punctuation class.
Assignee | ||
Comment 20•17 years ago
|
||
Don't break around '(' and ')' and ';'.
Cleaning up the code with Mozilla standard coding style.
Attachment #273852 -
Attachment is obsolete: true
Assignee | ||
Comment 21•17 years ago
|
||
umm... by comment 19, the rendering of https://bugzilla.mozilla.org/attachment.cgi?id=273944&action=diff#mozilla/intl/lwbrk/tools/spec_table.html_sec1 is broken. Should we break only between '>' and '<' for SGML/HTML/XML?
Assignee | ||
Comment 22•17 years ago
|
||
er, the current approach for '(' and ')' are wrong.
data:text/html,<div%20style="width:1px;">a[([b])]c</div>
This case, cannot clear. I'll try to fix it...
Assignee | ||
Comment 23•17 years ago
|
||
O.K. This patch works fine for me.
I create new two classes, [c] and [d]. [c] is character like open parenthesis. [d] is character like close parenthesis (or punctuation). By the two class, '><' problem is cleared. And also '[([' problem too. And contextual analysis is changed to very simple.
Attachment #273944 -
Attachment is obsolete: true
Attachment #274126 -
Flags: superreview?(roc)
Attachment #274126 -
Flags: review?(roc)
Assignee | ||
Comment 24•17 years ago
|
||
sorry, for the spam.
this is better names.
Attachment #274126 -
Attachment is obsolete: true
Attachment #274128 -
Flags: superreview?(roc)
Attachment #274128 -
Flags: review?(roc)
Attachment #274126 -
Flags: superreview?(roc)
Attachment #274126 -
Flags: review?(roc)
It looks OK, but I don't see how it fixes this bug. The text
..."
is just a run of punctuation of class [d], and we can break between those characters. Right?
http://groups.google.com/group/mozilla.dev.tech.layout/browse_thread/thread/47c49202f2d13f94
Er, what I was going to say is, I think we should discuss this on this thread:
http://groups.google.com/group/mozilla.dev.tech.layout/browse_thread/thread/47c49202f2d13f94
before we go ahead. Also I'd like to find out what the jp-critical bugs are that require breaking of Latin-1 text around punctuation.
Updated•17 years ago
|
Flags: blocking1.9? → blocking1.9+
Updated•17 years ago
|
Component: Layout → Layout: Fonts and Text
Assignee | ||
Comment 27•17 years ago
|
||
This changes many character's behaviors by the feedbacks.
1. Added the new class which is not breakable around them even if there are breakable. But if there is SPACE, it is breakable. This behavior is defined in UAX#14.
2. [] is not breakable now if around characters are character/numeric class.
3. / and \ are not breakable now. But if a word has two or more /s or \s, they are breakable except the first one. And they are breakable *before* them at the time. Therefore, the path text always has / or \. I believe that this is best way. But / is not breakable everytime if next is numeric for date format. (2007/01/01)
4. DEGREE SIGN is not breakable if after character is character class. But it's still breakable if after character is numeric, for compatibility.
5. The hyphen is not breakable if the text doesn't have character class, or next is numeric class. Therefore, we cannot break 2007-Aug-07
6. Pound/Yen sign are not same behavior with IE7.
Attachment #274128 -
Attachment is obsolete: true
Attachment #275307 -
Flags: superreview?(roc)
Attachment #275307 -
Flags: review?(roc)
Attachment #274128 -
Flags: superreview?(roc)
Attachment #274128 -
Flags: review?(roc)
Assignee | ||
Comment 28•17 years ago
|
||
Oops, sorry for the mistake and the spam.
Attachment #275307 -
Attachment is obsolete: true
Attachment #275308 -
Flags: superreview?(roc)
Attachment #275308 -
Flags: review?(roc)
Attachment #275307 -
Flags: superreview?(roc)
Attachment #275307 -
Flags: review?(roc)
Assignee | ||
Comment 29•17 years ago
|
||
Hmm... I'm working for this testcase... I hate smiley :-(
Attachment #275308 -
Attachment is obsolete: true
Attachment #275308 -
Flags: superreview?(roc)
Attachment #275308 -
Flags: review?(roc)
Assignee | ||
Comment 30•17 years ago
|
||
Attachment #275854 -
Attachment is obsolete: true
Assignee | ||
Comment 31•17 years ago
|
||
I suppress the breaking in most ASCII characters if the around characters are characters or numerics.
This patch is changed from previous version:
1. Add new class that is for punctuation like character, it is same as character class, but it is needed for hypen processing.
2. The date format is checked the word length, therefore, the date format is not broken, but the chemical context is breakable.
3. If the last character of a word is CLASS_CLOSE or CLASS_CLOSE_LIKE_CHARACTER, we never break before it.
4. '%' is not breakable. But if the 3rd before or after character is '%', It may be '%' encoded URL param, therefore, it is broken before only then.
5. '&' and ';' is not breakable, But if there are one or more '='s before it, they are broken before them. It may be a param of URL.
6. '-' is not breakable, But if the both sides are numeric or character class, it is broken after.
Note that they are checked the context only in a word, therefore, the all testcases are cleared by this patch.
Attachment #275972 -
Attachment is obsolete: true
Attachment #275975 -
Flags: superreview?(roc)
Attachment #275975 -
Flags: review?(roc)
Assignee | ||
Updated•17 years ago
|
Attachment #275972 -
Attachment is obsolete: false
Are these testcases added to the reftests?
Assignee | ||
Comment 33•17 years ago
|
||
It's difficult issue, because the line-breaking spec may be fluid now...
# If some languages users have serious issue, we may give up some non-important cases.
But for non-intended regressions, it's a good thing...
Sure, you can change the tests as you change the desired behaviour; the tests are there to tell you when you make changes that _aren't_ desired. :)
Assignee | ||
Comment 35•17 years ago
|
||
roc:
Should we land it for getting new feedbacks?
Assignee | ||
Comment 36•17 years ago
|
||
work in progress for new "safety" approach.
Attachment #275975 -
Attachment is obsolete: true
Attachment #275975 -
Flags: superreview?(roc)
Attachment #275975 -
Flags: review?(roc)
Assignee | ||
Comment 37•17 years ago
|
||
This should be 'safety' for western language.
1. Using CLASS_OPEN_LIKE_CHARACTER and CLASS_CLOSE_LIKE_CHARACTER for western punctuations.
2. If a breakable point is 'near' from start of word, end of word and previous breakable point in western language context, we use GetPairForWestern instead of GetPair. It is not breaking between numeric, western characters, CLASS_OPEN_LIKE_CHARACTER and CLASS_CLOSE_LIKE_CHARACTER.
# now, I defined that the 'near' is 6 characters. By this, the date format is not broken. (E.g., 01/01/2007 and 2007-01-01)
3. If the word has CJK characters, we always use GetPair. So, CJK context is always broken by JIS X 4051 rules.
Attachment #279309 -
Attachment is obsolete: true
Attachment #279339 -
Flags: superreview?(roc)
Attachment #279339 -
Flags: review?(roc)
Assignee | ||
Comment 38•17 years ago
|
||
Attachment #275972 -
Attachment is obsolete: true
Assignee | ||
Comment 39•17 years ago
|
||
Comment on attachment 279339 [details] [diff] [review]
Patch rv5.1
er, this has a mistake, please wait next.
Attachment #279339 -
Attachment is obsolete: true
Attachment #279339 -
Flags: superreview?(roc)
Attachment #279339 -
Flags: review?(roc)
Assignee | ||
Comment 40•17 years ago
|
||
Attachment #279352 -
Flags: superreview?(roc)
Attachment #279352 -
Flags: review?(roc)
Assignee | ||
Comment 41•17 years ago
|
||
Attachment #279340 -
Attachment is obsolete: true
+IS_WORDSEPARATOR(PRUnichar u)
+{
+ return (0x0009 <= u && u <= 0x000D) || u == 0x0020 || u == 0x0085 ||
+ u == 0x00A0 || u == 0x1680 || u == 0x180E ||
+ (0x2000 <= u && u <= 0x200B) || u == 0x2028 || u == 0x2029 ||
+ u == 0x202F || u == 0x205F || u == 0x3000;
+}
Could you add comments or #defines explaining what all these are? Also for the first range, if we're going to include 0x000B and 0x000C, why not include all characters <= 0x0020?
+GetPairForWestern(PRInt8 c1, PRInt8 c2)
I think this name is misleading. It's not "western" in particular because the other GetPair is also used to break "Western" words if they're long enough. This function is more like GetPairConservative; we use it where we want to err on the side of *not* breaking. This would mean changes to some of the comments.
+ void SetIndex(PRUint32 aIndex) {
+ NS_ASSERTION(mIndex < aIndex || mIndex == aIndex && mIndex == 0,
+ "not supported to index decreasing");
Call this AdvanceIndexTo. The comment can say "the index cannot decrease." You could just test mIndex <= aIndex.
+ PRBool UseJISX4051() {
We're not using JISx4051, really, so we shouldn't add new references to it, right?
+ PRUnichar mPrevOfHyphen;
Call it "mCharBeforeHyphen" and add a comment explaining exactly what it is.
+ PRPackedBool mHasCJKChar;
Call it mWordHasCJKChar and document exactly what it means.
+ PRUint32 mCountOfSlash;
+ PRUint32 mCountOfBackSlash;
+ PRUint32 mCountOfEqual;
Instead of storing these, why not just calculate them when they're needed, since they're hardly ever needed?
+ if (prevIsNum && nextIsNum)
+ return CLASS_NUMERIC;
+ // If both sides are numeric or character, this hyphen should be breakable.
Hmm ... it looks like if both sides are numeric, we don't allow breaking?
+ // This text has two or more (BACK)SLASHs, this may be file path or URL.
Should be "If this text..."
+ // This text has two or more (BACK)SLASHs, this may be file path or URL.
+ if (aState.UseJISX4051()) {
+ PRInt32 count =
+ (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
+ if (count > 1)
+ return CLASS_OPEN;
+ }
Why do we need this? We would still break after a slash without this code, right?
I also don't see why we need special code for % characters.
+ // If this may be a separator of params of URL, we should break after.
+ if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
+ return CLASS_CLOSE;
Why don't we make these CLASS_CLOSE always?
Assignee | ||
Comment 43•17 years ago
|
||
(In reply to comment #42)
> + PRBool UseJISX4051() {
>
> We're not using JISx4051, really, so we shouldn't add new references to it,
> right?
Do you have the idea for the better name?
> + if (prevIsNum && nextIsNum)
> + return CLASS_NUMERIC;
> + // If both sides are numeric or character, this hyphen should be
> breakable.
>
> Hmm ... it looks like if both sides are numeric, we don't allow breaking?
Right. Do we need to break?
> + // This text has two or more (BACK)SLASHs, this may be file path or URL.
> + if (aState.UseJISX4051()) {
> + PRInt32 count =
> + (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
> + if (count > 1)
> + return CLASS_OPEN;
> + }
>
> Why do we need this? We would still break after a slash without this code,
> right?
No, U_SLASH and U_BACKSLASH is CLASS_CHARACTER now. Therefore, they are not breakable. We need to change the CLASS when we need to break.
> I also don't see why we need special code for % characters.
>
> + // If this may be a separator of params of URL, we should break after.
> + if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
> + return CLASS_CLOSE;
>
> Why don't we make these CLASS_CLOSE always?
I think U_AMPERSAND and U_SEMICOLON should not be breakable in normal sentence, right? E.g., "A&B" in western language or " " in the testcase.
(In reply to comment #43)
> (In reply to comment #42)
> > + PRBool UseJISX4051() {
> >
> > We're not using JISx4051, really, so we shouldn't add new references to it,
> > right?
>
> Do you have the idea for the better name?
How about UseConservativeBreaking() (which is the opposite of what UseJISX4051 returns now)?
> > + if (prevIsNum && nextIsNum)
> > + return CLASS_NUMERIC;
> > + // If both sides are numeric or character, this hyphen should be
> > breakable.
> >
> > Hmm ... it looks like if both sides are numeric, we don't allow breaking?
>
> Right. Do we need to break?
Ok, the comment needs to be more clear. I guess it should say "if one side is numeric and the other is a character, the hyphen should be breakable".
> > + // This text has two or more (BACK)SLASHs, this may be file path or URL.
> > + if (aState.UseJISX4051()) {
> > + PRInt32 count =
> > + (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
> > + if (count > 1)
> > + return CLASS_OPEN;
> > + }
> >
> > Why do we need this? We would still break after a slash without this code,
> > right?
>
> No, U_SLASH and U_BACKSLASH is CLASS_CHARACTER now. Therefore, they are not
> breakable. We need to change the CLASS when we need to break.
OK
> > I also don't see why we need special code for % characters.
> >
> > + // If this may be a separator of params of URL, we should break after.
> > + if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
> > + return CLASS_CLOSE;
> >
> > Why don't we make these CLASS_CLOSE always?
>
> I think U_AMPERSAND and U_SEMICOLON should not be breakable in normal sentence,
> right? E.g., "A&B" in western language or " " in the testcase.
OK, although both of those would not be broken thanks to DONT_BREAK_RANGE.
BTW "DONT_BREAK_RANGE" would be better named "CONSERVATIVE_BREAK_RANGE" since we can actually break in that region...
I don't like this special-case (approximate) detection of URLs and file paths. But I don't have a better idea right now.
Assignee | ||
Comment 45•17 years ago
|
||
> +IS_WORDSEPARATOR(PRUnichar u)
> +{
> + return (0x0009 <= u && u <= 0x000D) || u == 0x0020 || u == 0x0085 ||
> + u == 0x00A0 || u == 0x1680 || u == 0x180E ||
> + (0x2000 <= u && u <= 0x200B) || u == 0x2028 || u == 0x2029 ||
> + u == 0x202F || u == 0x205F || u == 0x3000;
> +}
>
> Could you add comments or #defines explaining what all these are? Also for the
> first range, if we're going to include 0x000B and 0x000C, why not include all
> characters <= 0x0020?
They are listed in Unicode spec as white spaces.
http://www.unicode.org/Public/UNIDATA/PropList.txt
Attachment #279352 -
Attachment is obsolete: true
Attachment #279495 -
Flags: superreview?(roc)
Attachment #279495 -
Flags: review?(roc)
Attachment #279352 -
Flags: superreview?(roc)
Attachment #279352 -
Flags: review?(roc)
> HasCharacterInWordAlrady
HasCharacterInWordAlready
+ void Reset() {
Call this "NextWord" instead ... "Reset" sounds like it will reset everything to the initial state.
IS_WORDSEPARATOR is really complicated. I don't think we need to care about all those spaces. How about we just limit it to 0x0A, 0x0D, 0x20, 0xA0, 0x200B, and 0x3000? Then we could change nsLineBreaker::IsSpace to match, and then simplify your nsLineBreakder code by removing the word-separator related code from ContextState, since we know IS_WORDSEPARATOR characters will not occur in the parameter to GetJISx4051Breaks. Likewise you would change WordMove so that it stops at IS_WORDSEPARATOR instead of IS_SPACE. What do you think of that?
I still don't understand why we need to break before %.
I also still don't like the URL-detection stuff that's going on here, although I still haven't thought of a better way.
+ PRBool UseConservativeBreakingAtNext() {
Instead of writing the same function twice, how about taking a PRUint32 parameter that represents the offset from the current position that you want to test (which will be 1 or 0)?
Is semicolon really a URL parameter separator?
Assignee | ||
Comment 48•17 years ago
|
||
Sorry, for the delay.
(In reply to comment #46)
> I still don't understand why we need to break before %.
Most Japanese character is 3 bytes in UTF-8. Therefore, the encoded Japanese text is:
(the count of Japanese character) * 3 bytes * 3 (%xx)
So, the encoded string is usually very long word.
(In reply to comment #47)
> Is semicolon really a URL parameter separator?
HTML4 spec said:
http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2
> We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
But I don't know whether such implementations are on the Earth :-(
Assignee | ||
Comment 49•17 years ago
|
||
(In reply to comment #46)
> IS_WORDSEPARATOR is really complicated. I don't think we need to care about all
> those spaces. How about we just limit it to 0x0A, 0x0D, 0x20, 0xA0, 0x200B, and
> 0x3000? Then we could change nsLineBreaker::IsSpace to match, and then simplify
> your nsLineBreakder code by removing the word-separator related code from
> ContextState, since we know IS_WORDSEPARATOR characters will not occur in the
> parameter to GetJISx4051Breaks. Likewise you would change WordMove so that it
> stops at IS_WORDSEPARATOR instead of IS_SPACE. What do you think of that?
Sounds good. But if I change nsLineBreaker::IsSpace, it seems that I need to change other codes. Because the current code assume that arround |nsLineBreaker::IsSpace| can break. But 0x0A is not so...
And we should keep to check |0x2000 <= u && u <= 0x200B|. Because they can be in Western text. We should keep the spec in Western context, I think.
OK, drop 0x0A from IS_WORDSEPARATOR/IsSpace. 0x0D can stay; I think it's OK to allow a soft break after that. The text-frame code won't ever use a soft break at the very start of a line.
> And we should keep to check |0x2000 <= u && u <= 0x200B|
OK
Let's drop breaking after ';', I've never seen that as a URL parameter separator.
Breaking before a % is OK.
Assignee | ||
Comment 51•17 years ago
|
||
(In reply to comment #50)
> OK, drop 0x0A from IS_WORDSEPARATOR/IsSpace. 0x0D can stay; I think it's OK to
> allow a soft break after that. The text-frame code won't ever use a soft break
> at the very start of a line.
hmm... If I drop 0x0A from them, we cannot clear the test of 48, 49 and 50 in the latest testcase, is it O.K.? It seems that they are realistic cases...
Assignee | ||
Comment 52•17 years ago
|
||
> 0x0A
er, 0xA0.
Assignee | ||
Comment 53•17 years ago
|
||
how about this?
In UseConservativeBreaking, the no-breakable spaces are searched directly. (Only when the text has one or more non-breakable spaces.)
# and 0x2007 is not a space, it's no-breakable, sorry.
Attachment #279495 -
Attachment is obsolete: true
Attachment #280777 -
Flags: superreview?(roc)
Attachment #280777 -
Flags: review?(roc)
Attachment #279495 -
Flags: superreview?(roc)
Attachment #279495 -
Flags: review?(roc)
> test of 48, 49 and 50 in the latest testcase
What testcase?
+ u == 0x000A || // LINE FEED
Don't treat this as nsLineBreaker::IsSpace or IS_SPACE
+static inline PRBool
IS_SPACE(PRUnichar u)
Can we move this to a static inline function in nsILineBreaker? Then it could be shared with nsLineBreaker.h so we don't have to duplicate code and keep it consistent.
There is a problem here which is that spaces which combine with an adjacent combining mark are not really breakable. I guess I will try to solve this at a higher level by changing the text that gets input to the linebreaker and textrun.
+ for (PRUint32 i = index; index - CONSERVATIVE_BREAK_RANGE < i;) {
+ if (IS_NO_BREAKABLE_SPACE(GetCharAt(--i)))
Rewrite this to not use predecrement, please. I find it confusing. Also here:
+ if (GetCharAt(--i) == aCh)
+ PRUnichar ch = GetCharAt(--i);
mHasNoBreakableSpace should be "mHasNonbreakableSpace" and IS_NO_BREAKABLE_SPACE should be IS_NONBREAKABLE_SPACE. ("mHasNoBreakableSpace" would mean that the word has no breakable spaces in it, which is always true. Also, "non" is not an English word on its own so we shouldn't treat it as one.)
+ // If this text has no-breakable space, we need to check whether the index
+ // is near it.
Delete "If" --- at this point, we know the text has nonbreakable space.
This is looking good.
Comment on attachment 280777 [details] [diff] [review]
Patch rv5.4
clearing request pending comments
Attachment #280777 -
Flags: superreview?(roc)
Attachment #280777 -
Flags: review?(roc)
Assignee | ||
Comment 56•17 years ago
|
||
>> test of 48, 49 and 50 in the latest testcase
>
> What testcase?
attachment 279353 [details].
Attachment #280777 -
Attachment is obsolete: true
Attachment #281318 -
Flags: superreview?(roc)
Attachment #281318 -
Flags: review?(roc)
Assignee | ||
Comment 57•17 years ago
|
||
er, sorry, I killed wrong line...
Attachment #281318 -
Attachment is obsolete: true
Attachment #281322 -
Flags: superreview?(roc)
Attachment #281322 -
Flags: review?(roc)
Attachment #281318 -
Flags: superreview?(roc)
Attachment #281318 -
Flags: review?(roc)
Comment on attachment 281322 [details] [diff] [review]
Patch rv5.5.1
OK I like this. Let's land it.
Can you turn that testcase into a reftest?
Attachment #281322 -
Flags: superreview?(roc)
Attachment #281322 -
Flags: superreview+
Attachment #281322 -
Flags: review?(roc)
Attachment #281322 -
Flags: review+
Assignee | ||
Comment 59•17 years ago
|
||
checked-in the patch.
(In reply to comment #58)
> Can you turn that testcase into a reftest?
O.K. I'll try to create it tonight.
Assignee | ||
Comment 60•17 years ago
|
||
ah, my working files were lost by a trouble of my system... :-(
please wait the reftests. sorry.
Assignee | ||
Comment 61•17 years ago
|
||
Roc:
(In reply to comment #48)
> (In reply to comment #46)
> (In reply to comment #47)
> > Is semicolon really a URL parameter separator?
>
> HTML4 spec said:
>
> http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2
>
> > We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
>
> But I don't know whether such implementations are on the Earth :-(
I found such implementation:
http://git.kernel.org/
Assignee | ||
Comment 62•17 years ago
|
||
I found 3 bugs at creating reftests.
1. '?' is defined as CLASS_CLOSE, it should be CLASS_CLOSE_LIKE_CHARACTER for prohibiting the breaking between alphabets.
2. ContextState::GetPreviousNonHyphenCharacter returns PR_FALSE if mIndex == 0, it should be U_NULL.
3. The class of HYPHEN and FIGURE DASH is deferent from HYPHEN/MINUS in ASCII. They should be same class for same behavior.
And for comment 61, we should break after SEMICOLON if it is a part of param of URL. We should respect the HTML spec. And PHP can use ';' as '&'. See http://www.php.net/manual/en/ini.core.php#ini.arg-separator.input
Attachment #281998 -
Flags: superreview?(roc)
Attachment #281998 -
Flags: review?(roc)
Attachment #281998 -
Flags: approval1.9?
Assignee | ||
Comment 63•17 years ago
|
||
er, And EN DASH was not added to IS_HYPHEN, it should be hyphen in UAX#14.
Assignee | ||
Comment 64•17 years ago
|
||
And the leaders (U+2024 ONE DOT LEADER, U+2025 TWO DOT LEADER and U+2026 HORIZONTAL ELLIPSIS) should be CLASS_CLOSE_LIKE_CHARACTER. (They should be same class as FULL STOP of ASCII.)
Attachment #281998 -
Attachment is obsolete: true
Attachment #281999 -
Flags: superreview?(roc)
Attachment #281999 -
Flags: review?(roc)
Attachment #281999 -
Flags: approval1.9?
Attachment #281998 -
Flags: superreview?(roc)
Attachment #281998 -
Flags: review?(roc)
Attachment #281998 -
Flags: approval1.9?
Assignee | ||
Comment 65•17 years ago
|
||
hmmm.... The for loops were changed the meaning after roc's review...
I hate the unsigned index loop :-(
Attachment #281999 -
Attachment is obsolete: true
Attachment #282034 -
Flags: superreview?(roc)
Attachment #282034 -
Flags: review?(roc)
Attachment #282034 -
Flags: approval1.9?
Attachment #281999 -
Flags: superreview?(roc)
Attachment #281999 -
Flags: review?(roc)
Attachment #281999 -
Flags: approval1.9?
Comment on attachment 282034 [details] [diff] [review]
fix some nits. v3.
you could write those for loops as
for (PRUint32 i = mIndex; i > 0; --i) {
if (GetCharAt(i - 1) == aCh)
return PR_TRUE;
}
Attachment #282034 -
Flags: superreview?(roc)
Attachment #282034 -
Flags: superreview+
Attachment #282034 -
Flags: review?(roc)
Attachment #282034 -
Flags: review+
Attachment #282034 -
Flags: approval1.9?
Attachment #282034 -
Flags: approval1.9+
Assignee | ||
Comment 68•17 years ago
|
||
sorry, for the delay.
This is a first post for the testing. If I find more tests, I'll post is in new bug.
Attachment #283979 -
Flags: superreview?(roc)
Attachment #283979 -
Flags: review?(roc)
Assignee | ||
Comment 69•17 years ago
|
||
I updated the wiki to latest line-breaking spec. And I also removed the fixed issues documents.
http://wiki.mozilla.org/Gecko:Line_Breaking
Comment 70•17 years ago
|
||
What's the deeper sense of the "don't break near " rule?
Isn't it reasonable to break after the hyphen here:
xxx xxx-yyyyyy ?
Assignee | ||
Comment 71•17 years ago
|
||
(In reply to comment #70)
> Isn't it reasonable to break after the hyphen here:
> xxx xxx-yyyyyy ?
no, 'xxx-' is too short. non-breakable spaces might be word separator.
Comment 72•17 years ago
|
||
But xxx xxx isn't too short, and xxx xxx-yyyyyy is long enough to destroy layout.
Something like this should definitly break after "-":
xxxxxxxxxxxxxxxxx xxxx-yyyyyyyyyyyyyyyy
Furthermore, I dislike "don't break between hyphens".
Very long chains of "-" should break IMHO (min. 6 characters?).
And something like this should break also:
-------------------------------------------------------------xxxxxxxxxxxxx---------------------------------------------------------------------------------------------
Assignee | ||
Comment 73•17 years ago
|
||
(In reply to comment #72)
> But xxx xxx isn't too short, and xxx xxx-yyyyyy is long enough to
> destroy layout.
I have a question, do you think what cases? It seems that we cannot care your suggestion without hyphen. Isn't it a very special case? Note that we cannot support all cases on Earth.
> Furthermore, I dislike "don't break between hyphens".
> Very long chains of "-" should break IMHO (min. 6 characters?).
I think that this case really needs prioritized line-breaking. So, we should not care this case on Gecko 1.9.
Note that I know the current spec is not enough for many cases. Before this bug's patch was checked-in, very many ASCII characters were breakable. But it broke many text layout of Western language. See the bugs which was blocked by this bug.
Current approach is designed with following:
1. Don't break western text layout compatibility with Fx2.
2. Should break in URI for Japanese marketing.
3. Should break after hyphen for some Western language.
4. Don't use a risky way.
I don't think we should break long chains of consecutive hyphens. Someone's probably using them for graphic effect. It's best to err on the side of caution.
Comment on attachment 283979 [details] [diff] [review]
reftests for line-breaking
+<!-- BACK SLASH is used for YEN SIGN and WON SING in Japan and Korea -->
WON SIGN?
These look good! Thanks!
Attachment #283979 -
Flags: superreview?(roc)
Attachment #283979 -
Flags: superreview+
Attachment #283979 -
Flags: review?(roc)
Attachment #283979 -
Flags: review+
Attachment #283979 -
Flags: approval1.9+
Comment 76•17 years ago
|
||
can this be marked FIXED?
Assignee | ||
Comment 77•17 years ago
|
||
er, exactly. Sorry.
-> FIXED
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Updated•17 years ago
|
Flags: in-testsuite+
Comment 78•17 years ago
|
||
(In reply to comment #74)
> I don't think we should break long chains of consecutive hyphens. Someone's
> probably using them for graphic effect.
That's unlikely since IE, Safari3 and Opera9.2 break between any "--". (Opera 9.5, latest weekly, doesn't)
> It's best to err on the side of caution.
Isn't it best to reduce browser incompatibilities?
You need to log in
before you can comment on or make changes to this bug.
Description
•