Closed
Bug 200582
Opened 22 years ago
Closed 22 years ago
No line wrap for text without spaces in nsHTMLContentSerializer.cpp
Categories
(Core :: DOM: Editor, defect, P2)
Core
DOM: Editor
Tracking
()
RESOLVED
FIXED
mozilla1.4beta
People
(Reporter: nhottanscp, Assigned: KaiE)
References
Details
(Keywords: embed, topembed+, Whiteboard: edt_x3)
Attachments
(2 files, 8 obsolete files)
(deleted),
text/html
|
Details | |
(deleted),
patch
|
KaiE
:
review+
KaiE
:
superreview+
asa
:
approval1.4+
|
Details | Diff | Splinter Review |
In Mozilla mail (HTML), sending out long lines which do not include spaces (e.g.
Japanese text) do not wrap.
E.g. the next line has 100 characters and it does not wrap.
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
Reporter | ||
Comment 1•22 years ago
|
||
http://lxr.mozilla.org/seamonkey/source/content/base/src/nsHTMLContentSerializer.cpp#740
The code is trying to find a white space for a line break.
Note that mMaxColumn is not checked here.
740 // if we're past the wrapping width with no place to wrap at,
741 // find the next whitespace and wrap there
742 while (pos != end && *pos != ' ' && *pos != '\n') {
743 ++pos;
744 }
745
mAddSpace is false so going to else
746 if (mAddSpace) {
747 // whitespace was needed before the next segment, so we can put
748 // a newline instead of a space, and avoid getting a lone line
749 aOutputStr.Append(mLineBreak);
750 addLineBreak = PR_FALSE;
751
752 mColPos = pos - segStart;
753
754 // if the string doesn't end in whitespace, set mAddSpace to false
755 if (pos == end) {
756 mAddSpace = PR_FALSE;
757 }
758 }
Here is set addLineBreak to true. But at that point the position reaches at end,
so no linebreak is added before returning at line 766.
But the loop at line 742 does not do a column check, the line could be very long
even if we change to put one linebreak before return.
759 else {
760 // no choice but to write a long line and wrap immediately after it
761 addLineBreak = PR_TRUE;
762 }
763 aOutputStr.Append(segStart, pos - segStart);
764
765 if (pos == end) {
766 return;
767 }
Reporter | ||
Comment 2•22 years ago
|
||
At the loop at line 742, nsILineBreaker may be used to get a linebreak for text
without spaces like Japanese text, that is already used in
nsPlainTextSerializer. It can also check like (mMaxColumn * 2) to prevent the
line to be very long.
Keywords: embed
Comment 3•22 years ago
|
||
See also bug 56921, which covers a bug in this code which makes ascii lines too
long.
Reporter | ||
Comment 4•22 years ago
|
||
The patch is to prevent very long lines which may cause data corruption for
encoded Japanese text (e.g. ISO-2022-JP) when the server forces line breaks.
I think the integration of the line breaker for better formatting can be done
separately by someone more familiar with the code.
Reporter | ||
Updated•22 years ago
|
Attachment #119697 -
Flags: review?(akkana)
Comment 5•22 years ago
|
||
I'm not sure I'm understanding this correctly. Why doesn't the while (mColPos <
mMaxColumn) take care of this -- why isn't the new check redundant with that
one? By the time we get to the place where the patch adds the new check, isn't
mColPos >= mMaxColumn already, without seeing a space? So the change made by
the patch is that if there wasn't a space in the line, the line will be forcibly
broken in the middle of a word, whether or not a space was seen? Am I
understanding that correctly, and if so, is that really a change we want to
make? Will it cause a line of 79 ascii dashes to be broken into long-short
pairs (with the default wrapcol of 72)? I don't have a tree that builds right
now, and we're having unexplained network slowness today so it may be a while
before I can update and test it.
I wish we could just do wrapping right, using nsILineBreaker like in the
plaintext serializer and searching backwards as described in bug 56921, instead
of applying more band-aids on top of the existing band-aid layers.
Burpmaster, do you have time to glance at the patch and see if I'm
misinterpreting? You're probably the most familiar with this code.
Reporter | ||
Comment 6•22 years ago
|
||
>the patch is that if there wasn't a space in the line, the line will be forcibly
>broken in the middle of a word, whether or not a space was seen? Am I
Yes, for Japanese text that is okay (and this is for HTML source not affect
actual display), for English it could be a problem but it only happens if the
word is very long like greater than 72, so I assume practically okay.
>I wish we could just do wrapping right, using nsILineBreaker like in the
>plaintext serializer and searching backwards as described in bug 56921, instead
>of applying more band-aids on top of the existing band-aid layers.
I agree if the right implementation is available soon then the current patch
would not be needed.
If the wrapping has to happen within the column then the search has to go
backwards. And the line breaker will take care Japanese text but a long word
without a space like I mentioned in the original report would not be broken even
with the line breaker.
Reporter | ||
Comment 7•22 years ago
|
||
nsbeta1
The data corruption I mentioned in comment #4 is generic and can happen with
Mozilla mail.
Keywords: nsbeta1
Comment 8•22 years ago
|
||
Comment on attachment 119697 [details] [diff] [review]
When looping for white space, also check for max line.
Sorry, but no, I can't approve breaking wrapping latin text just because nobody
wants to spend the time writing code that calls nsILineBreaker. Lots of people
send plaintext mail with long separator lines of "*" or "=" or whatever, as
well as ascii art, but more important, people also send long urls in plaintext
mail, and breaking them means that the url can't be pasted into a browser.
Attachment #119697 -
Flags: review?(akkana) → review-
Reporter | ||
Comment 9•22 years ago
|
||
As I mentioned before, the line breaker can handle the Japanese text case but
not necessary can break a very long Ascii word like in my original report (see
my comment #4).
Comment 10•22 years ago
|
||
Leaving long words unbroken is intentional: breaking them at the normal wrapcol
of 72 is bad, especially on urls. See, for example, bug 137253. In fact, we
did that at one time, and got bugs reported on it. I could see an argument for
having a separate limit for forcibly breaking text (e.g. at 512 bytes or
whatever mail and news servers need), but that's a separate issue, and probably
harder than just calling the line breaker.
Reporter | ||
Comment 11•22 years ago
|
||
I see, then hooking up the line breaker is the way that will take care the
Japanese text case at least.
Reporter | ||
Comment 12•22 years ago
|
||
I looked at this again and found that Japanese text do not go to the function
(AppendToStringWrapped) we have looked at so far.
There is a function nsHTMLContentSerializer::AppendText which calls
AppendToStringConvertLF or AppendToStringWrapped.
For Japanese text, AppendToStringConvertLF is called and that function does not
do line wrapping.
http://lxr.mozilla.org/seamonkey/source/content/base/src/nsHTMLContentSerializer.cpp#162
So, in addition to change AppendToStringWrapped, it is also required to change
the caller or somewhere else to call AppendToStringWrapped for long Japanese text.
This really has to be looked by someone who is familiar with that area.
Adding 'topembed' keyword.
Keywords: topembed
Comment 14•22 years ago
|
||
>Leaving long words unbroken is intentional: breaking them at the normal wrapcol
>of 72 is bad, especially on urls.
If it's intentional then is this bug a WONTFIX?
Reporter | ||
Comment 15•22 years ago
|
||
That may apply to English text but not for Japanese which usually do not contain
spaces. So the current word breaking logic in nsHTMLContentSerializer cannot
word break Japanese text.
For Japanese HTML mail, it may cause data corruption when the server forces
linebreaks at certain byte length (e.g. breaking middle by multibyte character).
Comment 16•22 years ago
|
||
Discussed in edt bug triage. Plussing. Can some one please put the correct
Japanese behavior in this bug?
Reporter | ||
Comment 17•22 years ago
|
||
>So the current word breaking logic in nsHTMLContentSerializer cannot
>word break Japanese text.
Correction: I meant for line break.
Reporter | ||
Comment 18•22 years ago
|
||
Expected behavior for Japanese text is to line break at the specified column
even without a space.
Comment 19•22 years ago
|
||
Even if nsILineBreaker says it's not a valid place to break? Or can we trust
the line breaker to tell us when it's safe to break? I hope the serializer
doesn't have to start encoding special knowledge about different charsets.
Reporter | ||
Comment 20•22 years ago
|
||
addtional info for comment #14,
by using nsILineBreaker, it finds a linebreak without a space for Japnese text
Comment 21•22 years ago
|
||
Email comment from nhotta:
I assume it is not that common to keep typing Japanese text without hitting
a return. But some user may do that (or pasting such text is also possible).
This is dataloss, should be P2 (or higher).
Priority: -- → P2
Reporter | ||
Comment 22•22 years ago
|
||
Copy the entire text and paste it into HTML mail compose and send.
Reporter | ||
Comment 23•22 years ago
|
||
Attachment #121322 -
Attachment is obsolete: true
Reporter | ||
Comment 24•22 years ago
|
||
Had a meeting (akkana, harishd, jst, kaie, nhotta),
decided to use nsILineBreaker parsing backward.
Reporter | ||
Comment 25•22 years ago
|
||
I tested by copying the attached text to HTML mail (mail charset set to UTF-8).
The problem I mentioned in comment #12 was not seen.
The function AppendToStringWrapped() was called for the Japanese text (and
prelevel was zero).
So, the Japanese text would be wrapped if AppendToStringWrapped() change to use
nsILineBreaker.
Assignee | ||
Comment 26•22 years ago
|
||
I think this bug might already be fixed on the latest trunk.
To reproduce I:
- opened attachment 121324 [details] from this bug
- select all, copy
- open new mail message
- (charset)
- paste
- send
- receive
The results I see in the arriving message depend on the version of Mozilla and
the charset that has been set for the message, before pasting.
Latest Mozilla trunk, message charset iso-8859-1:
No bug, I think.
On send, I get a warning "message contains chars not contained
in selected coding". If I decide to send anyway, the mail that arrives
consists completely of "?" questions marks.
One could say this is a bug, too, not sure.
But at least, it's not the bug reported here.
Latest Mozilla trunk, message charset set to utf-8:
Everything seems to be perfect.
The message that arrives displays all the same character.
The rendered display shows some spaces in it, they come from
linebreaks inserted on send.
When looking at the message source, the message looks correct, too.
The message is actually base64 encoded.
But when you do a manual decode (I did this using command line tools)
you see that the underlying encoding is utf-8 and each line has
at most 990 characters.
There is a trick you can see the correctness without going through
the step of manual base64 decoding. Send the above message as a
"signed" message. In this case, no base64 encoding is used, but 8bit,
and you can save the message directly, and see that 990 bytes is max
per line.
Mozilla 1.3 branch build, message charset set to utf-8:
Works identical as in previos section
Mozilla 1.3 branch build, message charset iso-8859-1:
This is the only combination that looks like the bug described here.
The received message has line breaks in the middle of あ
html entities. That produces the wrong appearance.
For what it's worth, I also tried the above with a Mozilla 1.2 based version.
I can not even start to reproduce the bug.
When I attempt to paste the japanese text into a mail message, only garbage is
shown in the editor as a result.
nhotta: Can you confirm the bug is fixed with the latest trunk?
Assignee | ||
Comment 27•22 years ago
|
||
I talked to Naoki. The latest Mozilla trunk I was using automatically sent the
message as plain text, because it contained no tags. That was most likely the
reason why it worked for me.
In order to reproduce the bug, in addition to pasting the long content, one must
use at least one HTML style, like, make the first char bold or a larger font.
I now see some problems, continuing to investigate.
Reporter | ||
Comment 28•22 years ago
|
||
Additional step for the test case in comment #22 (sorry I forgot to mention),
"make the fist character as Bold to ensure the message is sent as text/html"
also set charset to "ISO-2022-JP" makes the problem happen more often
Assignee | ||
Comment 29•22 years ago
|
||
While thinking about this bug, I was wondering, how would line breaking behave
when there is a mixture of japanese text and long URLs in the body?
I've filed bug 203122, not sure if it is valid.
Assignee | ||
Comment 30•22 years ago
|
||
I'm close to having a patch.
I'd like to describe what the consequences of the patch are.
The wrap column for HTML source is set to 72.
We already wrap tags and latin words at whitespace bondaries.
The line breaker code seems to correctly identify asian characters that can be
used as a wrap position.
That means, by using the line breaker we don't break URLs - at least we don't
break URLs that are separated by whitespace from asian characters.
However, we will introduce a new behaviour that will make the wrapped text look
ugly, I think. Naoki, I wonder if this behaviour is ok.
By inserting line breaks in the middle of japanese text, we insert significant
whitespace.
Imagine a document with 200 japanese characters, first char bold.
The wrap column is 72.
When counting the length of the line, the wrap code does not distinguish between
tags and text.
That means, the more text tags there are on a line of japanese text, the less
text characters will stay on the line after wrapping.
<b>x</b>xx-199-times-xx
This will be broken into:
<b>x</b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
However, because the bold tags are not visual, the japanese text will look like
this:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
That means, it will appear to the user that we are wrapping at random positions
in the document. If there are HTML tags in the document, the right border can be
different on each line.
But because we are inserting significant whitespace, there are more effects when
the user continues to edit the document after saving and opening.
Imagine the user inserts 10 more japanese characters in the first line (the line
with the bold tag). While editing, we do not yet break, and the document will
look like that:
xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The first line is now longer than the wrap column, the other lines are still
short enough. When we save, the code will wrap the document again. The results are:
xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Question: Is the above behaviour, that will be a side effect of the patch,
acceptable?
If you don't like the behaviour, it would be tricky to avoid it. I see two
things we'd have to do:
- "unwrap" when editing. That is, when loading a document, remove every single
line break between two japanese characters, if there is nothing else between the
characters than the single line break. Unfortunately, that would also erase any
line breaks the user has typed explicitly.
- no not count HTML tag characters when searching for the wrap position
Finally my questions is, what do we want?
1) Land the patch and accept that we will see lines with Japanese text of
different lengths.
2) Do 1), and file a new bug on the "unwrap" and "don't count HTML tags to
implement wrapping" suggestions. This would be done at a later time in another step.
3) All problems must be solved at once, delay checking in the wrapping fix,
until we have a solution for all the problems described.
Reporter | ||
Comment 31•22 years ago
|
||
Kai, I have a question.
Does the inconsistent line length by HTML tags affect actual display when sent
as text/html or the lines are not aligned only in message source?
>But because we are inserting significant whitespace, there are more effects when
>the user continues to edit the document after saving and opening.
do you mean whitespace as CRLF?
Assignee | ||
Comment 32•22 years ago
|
||
> Does the inconsistent line length by HTML tags affect actual display when sent
> as text/html or the lines are not aligned only in message source?
It will affect the actual display, because the wrap logic counts any chars in
the source.
>> But because we are inserting significant whitespace, there are more effects
>> when the user continues to edit the document after saving and opening.
> do you mean whitespace as CRLF?
Yes, both CRLF and SPACE in HTML source have the same visual effect when the
document is actually displayed.
If we insert CRLF to break the line, this will result in a space to appear when
the document is displayed.
Assignee | ||
Comment 33•22 years ago
|
||
To clarify my long explanation, I was unclear when talking about what's
displayed when the document is displayed.
I said "because the bold tags are not visual, the japanese text will look like
this:"
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
You might see spaces instead of linebreaks in the actual display.
I also said, after doing two save/edit cycles with the example document, the
document will look like that:
xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Again, instead of line breaks, you could see a space between those blocks of
japanese characters.
Comment 34•22 years ago
|
||
Kaie, you're saying that a line break between two characters in the html source
always displays as a line break in the displayed html when displaying Japanese?
That would be odd; it's certainly not true in English/Latin text (though it
does usually show up as whitespace). Why is that different in Japanese?
Assignee | ||
Comment 35•22 years ago
|
||
> Kaie, you're saying that a line break between two characters in the html source
> always displays as a line break in the displayed html when displaying Japanese?
> That would be odd; it's certainly not true in English/Latin text (though it
> does usually show up as whitespace). Why is that different in Japanese?
I was wrong when I said "will be shown as line breaks".
What I want to say is: "The line breaks we introcuce will show up as spaces".
Although the japanese user did not enter any spaces, we will show spaces in the
document.
Reporter | ||
Comment 36•22 years ago
|
||
So actual space character is added for line break or CRLF added to the source is
shown as a space when displayed?
Assignee | ||
Comment 37•22 years ago
|
||
Oops. I have to apologize, sorry for the confusion.
The effects I saw, the shown spaces, were caused by hard spaces the preliminary
version of my line wrap patch is inserting into the document. :-(
I did not see that and made the incorrect conclusion about the behaviour of the
layout engine. Thanks for your comments, you helped me to understand it better.
Actually, the layout engine already seems to do what I suggested. When there is
a CRLF in the source between two japanese characters, nothing is rendered in the
display. So all is well.
I will come up with a patch shortly.
Reporter | ||
Comment 38•22 years ago
|
||
>Actually, the layout engine already seems to do what I suggested.
I think the change (not to show CRLF as a space for Japanese text) is done by
after NS7.0. The current trunk does not show spaces. So we might need some extra
work to support 7.0 or earlier versions.
Assignee | ||
Comment 39•22 years ago
|
||
I believe this patch is correct.
This is a rewrite of the AppendStringWrapped method. I began to tweak the
original method but the existing code looked to confusing to me, so I decided
I'm better off by rewriting it.
There are two basic ideas used in the patch
- as it has been suggested, when searching for a wrap position, use the line
breaker helper class, and only fall back to a simpler approach if it doesn't
help
- never translate a sequence of end-of-line characters (with no spaces or tabs)
into a space. This protects us from visible space showing up. Inserting
linebreaks is "better" than space, because the layout engine obviously is smart
enough to not show linebreaks as whitespace between asian characters.
I also added a lot of comments to the new code, in the hope the new code will
be more easily maintainable.
I have also added the tab character as a valid whitespace character. The
existing code only checked for ' ' and '\n'. Checking for '\t' in addition is
correct, right?
Attachment #119697 -
Attachment is obsolete: true
Assignee | ||
Comment 40•22 years ago
|
||
I tested with simple and long documents, long and short latin1 words and URLS,
and mixed documents, containing both asian and latin characters.
I also tested a large web document from http://www.spiegel.de
One thing I noticed is a slightly new behaviour, with the rewrite, I think it's
now better than before:
When the original document has a sequence of
<!-- x -->
<!-- x2 -->
the existing serializer transforms that into
<!-- x --><!-- x2 -->
The new code no longer does that.
Assignee | ||
Comment 41•22 years ago
|
||
Any help in reviewing this patch mostly appreciated!
Assignee | ||
Comment 42•22 years ago
|
||
Comment on attachment 121713 [details] [diff] [review]
Patch v2
Akkana offered to have a look at the patch. Thanks a lot.
Burpmaster, do you have time to look at the code, too? Akkana said, you have a
good knowledge of the wrapping code.
Attachment #121713 -
Flags: review?(akkana)
Comment 43•22 years ago
|
||
ftang can you take a look too?
Assignee | ||
Comment 44•22 years ago
|
||
I reread my patch, I want to make the following minor changes to the patch:
- mColPos += pos - sequenceStart;
is adding too much, mColPos is already at mMaxColumn,
fix this by removing that line
and adding a ++mColPos next to ++pos in the loop.
- all do{switch{}}while(); loops can optimized by moving the ++ statements
into the default: part of the switch block.
Assignee | ||
Updated•22 years ago
|
Attachment #121713 -
Attachment is obsolete: true
Attachment #121713 -
Flags: review?(akkana)
Assignee | ||
Comment 45•22 years ago
|
||
Slightly updated patch, as described in previous comment.
Assignee | ||
Updated•22 years ago
|
Attachment #121765 -
Flags: review?(akkana)
Reporter | ||
Comment 46•22 years ago
|
||
The displaying space problem I mentioned in comment #38, it happens with 7.0.
If I reply quote the HTML as plain text then actual spaces are inserted for the
original CRLF in the HTML source.
The current trunk does not have the problem.
I think we need to do something about it to support NS 7.0 users.
Assignee | ||
Comment 47•22 years ago
|
||
I don't have a good idea how to make our messages appear without spaces in the
display in Mozilla 1.0 based products.
Instead of wrapping with a line break char in the HTML source, we could also
write a <br> tag. This would cause Mozilla 1.0 to do a hard break whereever it
currently displays a space. This would make the spaces in the display go away,
but the lines will no longer be aligned at the right side.
But I think we don't want to insert <br>s into the source.
By doing so, we would change the content.
Reporter | ||
Comment 48•22 years ago
|
||
>But I think we don't want to insert <br>s into the source.
Right, CRLF is needed in order to avoid smtp server to force line break
(that is 1000 bytes according to RFC 2821 4.5.3.1).
I do not have a solution for the problem either.
A possible workaround is to make line break to occur less frequently.
E.g. if the break point is not a space or CRLF then search forward for another
break point until hit a certain limit (e.g. 256 character).
Assignee | ||
Comment 49•22 years ago
|
||
> A possible workaround is to make line break to occur less frequently.
> E.g. if the break point is not a space or CRLF then search forward for another
> break point until hit a certain limit (e.g. 256 character).
I fear we can't do that.
A japanese characters is stored as a byte sequence of multiple chars in HTML,
the char in the attached testcase is stored as あ
With the current default of wrapping at 72 characters, if a line contains only
japanese characters, we already have lines that are 576 bytes long!
If we want to stay below the max allowed length for mail messages, we could at
most go to 123 allowed characters per line.
I personally think, going from 72 to 123 does not improve our situation a lot.
And isn't our situation already better? If I understand correctly, one japanese
characters take up the same space as two latin characters. We still wrap at 72
characters, which is double the visual width as with latin text. That is, while
we wrap at the width of 72 latin characters for latin text, we wrap at the width
of 144 latin characters for japanese text.
I'd prefer if we accept that users using Mozilla 1.0 based software will see
some spaces, that we describe the behaviour in Mozilla 1.0 is a bug, and
recommend users to upgrade.
Assignee | ||
Comment 50•22 years ago
|
||
Note, the 1.2 branch behaves good, and does not show the spaces!
Reporter | ||
Comment 51•22 years ago
|
||
>A japanese characters is stored as a byte sequence of multiple chars in HTML,
>the char in the attached testcase is stored as あ
That happens when the charset cannot encode the Unicode code point.
For the attached HTML, the charset is UTF-8 and a character is encoded as three
bytes as 0xE38182 in that test case. So usually one character is encoded less
than 8 bytes as あ but it is true that such case can happen.
But I see your point about space seen per 144 column. If that is acceptable or
not that is not easy to tell. Without the fix we see data corruption when the
line exceeds 1000 bytes, with the fix 7.0 users see spaces every 72 characters.
Reporter | ||
Comment 52•22 years ago
|
||
The other fact is that HTML composed mail with text only is sent as text/plain
and the problem does not happen.
So, I think the new problem happens more often than the data corruption but
still does not happen so frequently. And as Kai mentioned the user has an option
to upgrade.
Assignee | ||
Comment 53•22 years ago
|
||
I propose we land this patch.
The patch is an improvement, and we don't have a perfect idea currently, only
ideas how to slightly raise the limit how soon the new problem will be seen.
The new problem is only of cosmetic nature (spaces). All data will be readable
by the recipient.
I think we should not care too much about Mozilla 1.0 in this case, because
Mozilla 1.0 is broken anyway, when typing long lines. We can't fix that in 1.0.
If we have a better idea at a later time, we can still implement it at a later time.
Naoki, do you agree to take this patch as a first step to improve the situation?
Reporter | ||
Comment 54•22 years ago
|
||
Kai, I agree with you going forward with the current patch. We need to fix the
data corruption.
But I have to mention that the cosmetic problem (spaces) is not a minor issue.
Assignee | ||
Updated•22 years ago
|
Attachment #121765 -
Flags: review?(akkana) → review?(jst)
Comment 55•22 years ago
|
||
I agree too. We need to fix the bug. It'd be nice if we had a solution that
would not reveal the Moz 1.0 bug in msgs sent from Moz 1.4, but we don't.
It's also not clear how often we will see an uninterrupted sequence of more
than 72 kanji/kana in HTML mail messages.
> I personally think, going from 72 to 123 does not improve our situation a lot.
I don't know how much it improves, but it would be an improvement. As you
note the column width is double for these characters, so it would be changing
if from 144 to 246 columns. So why not take it, even if it is a small
improvement?
Assignee | ||
Comment 56•22 years ago
|
||
> I don't know how much it improves, but it would be an improvement. As you
> note the column width is double for these characters, so it would be changing
> if from 144 to 246 columns. So why not take it, even if it is a small
> improvement?
The point is, there is not a single place in the source that we could change
from 72 to 123, but there are multiple places in the code that specify 72 or
other numbers.
The API to the content serializer takes the requested width as an argument, and
various code all over Mozilla makes use of that parameter and passes in the
desired width. The value 72 only happens to be the default in the case we are
looking at.
If we say, we want 123 as the desired number, this would effectively mean, we
remove the flexibility from the API, and have to hardcode the number into the
serializer - because we don't want smaller numbers, and we can't allow larger
numbers.
I don't know the existing good code good enough to judge, whether we would
introduce regressions by removing the parameter from the API and fixing the
value to 123.
Assignee | ||
Comment 57•22 years ago
|
||
My conclusion from my previous comment: I think the few % likelyhood reduction
we is not worth the trouble. Hardcoding the value is not an option, I think.
IMHO, the only reasonable approach to increase the number is: Investigate and
find all callers who call the function and pass in the wrap column as a
parameter, and change the calling code to possibly use larger numbers.
Comment 58•22 years ago
|
||
Kaie is right that hardcoding is not a good option. In mozilla mail, the wrap
column should be following the preference "mailnews.wraplength". Any "72"
that's compiled into the source is be purely a fallback for embedding apps which
don't control wrap length; and any embedding app that cares about such things
should probably have a central place to control it (e.g. a pref like mozilla
uses, or a single point from which the serializer is called after checking
things lke wrap column and html-vs-plaintext).
Comment 59•22 years ago
|
||
Comment on attachment 121765 [details] [diff] [review]
Patch v2b
+ nsresult rv;
+ nsCOMPtr<nsILineBreakerFactory> lf(do_GetService(kLWBrkCID, &rv));
+ if (NS_SUCCEEDED(rv)) {
+ nsAutoString lbarg;
+ rv = lf->GetBreaker(lbarg, getter_AddRefs(mLineBreaker));
+ if (NS_FAILED(rv)) return NS_ERROR_FAILURE;
+ }
If a document is reachable here, you should ask it for the line breaker, if
there is no document, or it has no line breaker, then get the service and ask
it. And in stead of creating an empty nsAutoString object to pass to
GetBreaker(), simply pass nsString() in stead, less code, and faster code.
- In nsHTMLContentSerializer::AppendToStringWrapped():
The new logic here looks good to me, but it's non-trivial, so there's really no
way to know for sure that this works w/o doing a lot of testing on a lot of
data. As for this method, it suffers from a very common problem we see all over
Mozilla code, it's way too long, and that makes it really hard to read. Could
it be split up into two or more methods? Say one that handles a whole sequence
of whitespace, and one that handles non-whitespace. That way you have one
simple loop that calls either of the two above functions, depending on if we're
dealing with whitespace or not.
I have a few style-nits with the code, see below:
+ do {
...
+ }
+ while (...);
Personally, I don't like that style at all, I'd want the while to be on the
same line as the closing brace, i.e.:
+ do {
...
+ } while (...);
That way it's obvious that the while belongs to the above do. There's a few of
those in this code.
+ do {
+ switch (*pos) {
+ case ' ':
+ case '\t':
+ case '\n':
+ foundWhitespaceInLoop = PR_TRUE;
+ break;
+ default:
+ ++pos;
+ ++mColPos;
+ break;
+ }
+ }
+ while (!foundWhitespaceInLoop &&
+ pos < end &&
+ mColPos < mMaxColumn);
Same here, and make the last two lines of the while condition line up with the
opening paren. And if you'd make the switch in that loop an if, you could
simply break out of the loop when you find some whitespace and avoid checking a
boolean, in addition to the other checks, every iteration.
+ result = mLineBreaker->Prev(sequenceStart,
+ (end - sequenceStart),
+ (pos - sequenceStart)+1,
+ &wrapPosition,
+ &oNeedMoreText);
+ if (NS_SUCCEEDED(result) && !oNeedMoreText) {
+ foundWrapPosition = PR_TRUE;
+ }
+ else {
+ result = mLineBreaker->Next(sequenceStart,
+ (end - sequenceStart),
+ (pos - sequenceStart),
+ &wrapPosition,
+ &oNeedMoreText);
+ if (NS_SUCCEEDED(result) && !oNeedMoreText) {
+ foundWrapPosition = PR_TRUE;
+ }
+ }
What if oNeedMoreText? Shouldn't happen, right? Maybe assert that it doesn't?
+ do {
+ switch (*pos) {
+ case ' ':
+ case '\t':
+ case '\n':
+ foundWhitespaceInLoop = PR_TRUE;
+ break;
+ default:
+ ++pos;
+ ++mColPos;
+ break;
+ }
+ }
+ while (!foundWhitespaceInLoop && pos < end);
Same thing here, move the while, and you could avoid the need to check
!foundWhitespaceInLoop if you change the switch to an if and just break out of
the loop when you hit whitespace.
Other than that, this looks really good.
Attachment #121765 -
Flags: review?(jst) → review-
Comment 60•22 years ago
|
||
Comment on attachment 121765 [details] [diff] [review]
Patch v2b
I have a comment of my own.
>+ if (pos == end) {
>+ return;
>+ }
This is no longer needed because the while loop won't be entered in this case.
>+ while (pos < end) {
Assignee | ||
Comment 61•22 years ago
|
||
> If a document is reachable here, you should ask it for the line breaker, if
> there is no document, or it has no line breaker, then get the service and ask
> it.
The only way I found to reach a document was
from nsIDOMText to nsIContent nsIDocument.
Changed.
> And in stead of creating an empty nsAutoString object to pass to
> GetBreaker(), simply pass nsString() in stead, less code, and faster code.
done
> The new logic here looks good to me, but it's non-trivial, so there's really
no
> way to know for sure that this works w/o doing a lot of testing on a lot of
> data.
true
> As for this method, it suffers from a very common problem we see all over
> Mozilla code, it's way too long, and that makes it really hard to read. Could
> it be split up into two or more methods? Say one that handles a whole
sequence
> of whitespace, and one that handles non-whitespace.
done, split into two methods
> + }
> + while (...);
> change to
> + } while (...);
done
> + while (!foundWhitespaceInLoop &&
> + pos < end &&
> + mColPos < mMaxColumn);
> And if you'd make the switch in that loop an if, you could
> simply break out of the loop when you find some whitespace and avoid checking
a
> boolean, in addition to the other checks, every iteration.
done, both places
> What if oNeedMoreText? Shouldn't happen, right? Maybe assert that it doesn't?
In my tests, it did happen!.
But no need for an assertion.
If we don't set foundWrapPosition, we will fall back to the trivial break
logic.
> + if (pos == end) {
> + return;
> + }
> This is no longer needed because the while loop won't be entered in this
case.
Removed
Attachment #121765 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #122290 -
Flags: review?(jst)
Comment 62•22 years ago
|
||
Comment on attachment 122290 [details] [diff] [review]
Patch v3
+ nsCOMPtr<nsIContent> content = do_QueryInterface(aText);
+ if (content) {
+ nsCOMPtr<nsIDocument> document;
+ content->GetDocument(*getter_AddRefs(document));
+ if (document) {
+ document->GetLineBreaker(getter_AddRefs(mLineBreaker));
+ }
+ }
Using GetOwnerDocument() on aText would be a bit more reliable, i.e.:
+ nsCOMPtr<nsIDOMDocument> domDoc
+ aText->GetOwnerDocument(getter_AddRefs(domDoc));
+ nsCOMPtr<nsIDocument> document = do_QueryInterface(domDoc);
+ if (document) {
+ document->GetLineBreaker(getter_AddRefs(mLineBreaker));
+ }
Further down:
+ do {
...
+ } while (pos < end &&
+ mColPos < mMaxColumn);
Isn't the latter condition more common? If so, make it the first condition...
+ nsresult result;
Nit: Wanna use "rv" here (and other places too where you change related code)
in stead of "result" (like we do in most places) to make lines a bit shorter?
r=jst
Attachment #122290 -
Flags: review?(jst) → review+
Assignee | ||
Comment 63•22 years ago
|
||
Patch having the requested changes.
Attachment #122290 -
Attachment is obsolete: true
Assignee | ||
Comment 64•22 years ago
|
||
Comment on attachment 122481 [details] [diff] [review]
Patch v4
carrying forward r=jst
Peter, could you please sr= ?
Attachment #122481 -
Flags: superreview?(peterv)
Attachment #122481 -
Flags: review+
Comment 65•22 years ago
|
||
Comment on attachment 122481 [details] [diff] [review]
Patch v4
Forgot to mention one nit:
+ do {
+ PRUnichar c = *pos;
+ if (' ' == c || '\t' == c || '\n' == c) {
+ foundWhitespaceInLoop = PR_TRUE;
+ break;
+ }
+ else {
+ ++pos;
+ ++mColPos;
+ }
+ } while (mColPos < mMaxColumn &&
+ pos < end);
No need for else after break there, i.e.:
+ do {
+ PRUnichar c = *pos;
+ if (' ' == c || '\t' == c || '\n' == c) {
+ foundWhitespaceInLoop = PR_TRUE;
+ break;
+ }
+
+ ++pos;
+ ++mColPos;
+ } while (mColPos < mMaxColumn &&
+ pos < end);
And same thing here:
+ do {
+ PRUnichar c = *pos;
+ if (' ' == c || '\t' == c || '\n' == c) {
+ break;
+ }
+ else {
+ ++pos;
+ ++mColPos;
+ }
+ } while (pos < end);
My review still applies.
Assignee | ||
Comment 66•22 years ago
|
||
My newest local patch file has the requested nit corrected.
Comment 67•22 years ago
|
||
Kai, wil this be targetted to land on the 1.5 train? Can you set the
appropriate target milestone?
Assignee | ||
Updated•22 years ago
|
Target Milestone: --- → mozilla1.4beta
Comment 68•22 years ago
|
||
Comment on attachment 122481 [details] [diff] [review]
Patch v4
>Index: mozilla/content/base/src/nsHTMLContentSerializer.cpp
>===================================================================
>@@ -82,8 +87,10 @@
> mColPos = 0;
> mIndent = 0;
> mAddSpace = PR_FALSE;
>+ mMayIgnoreLineBreakSequence = PR_FALSE;
> mInBody = PR_FALSE;
> mInCDATA = PR_FALSE;
>+ mNeedLineBreaker = PR_TRUE;
Please make these member initializers.
>@@ -188,6 +217,244 @@
> return NS_OK;
> }
>
>+void nsHTMLContentSerializer::AppendWrapped_WhitespaceSequence(
>+ nsASingleFragmentString::const_char_iterator &pos,
>+ const nsASingleFragmentString::const_char_iterator end,
>+ const nsASingleFragmentString::const_char_iterator sequenceStart,
>+ PRBool &mayIgnoreStartOfLineWhitespaceSequence,
>+ nsAString& aOutputStr)
aPos, aEnd, ...
Please be consistent, move the &'s next to the type.
>+{
>+ // Handle the complete sequence of whitespace.
>+ // Continue to iterate until we find the first non-whitespace char.
>+ // Updates "pos" to point to the first unhandled char.
>+ // Also updates the mayIgnoreStartOfLineWhitespaceSequence flag,
>+ // as well as the other "global" state flags.
>+
>+ PRBool sawBlankOrTab = PR_FALSE;
>+ PRBool leaveLoop = PR_FALSE;
>+
>+ do {
>+ switch (*pos) {
>+ case ' ':
>+ case '\t':
>+ sawBlankOrTab = PR_TRUE;
>+ // no break
>+ case '\n':
>+ ++pos;
>+ // do not increase mColPos,
Trailing whitespace.
>+ // because we will reduce the whitespace to a single char
>+ break;
>+ default:
>+ leaveLoop = PR_TRUE;
>+ break;
>+ }
>+ } while (!leaveLoop && pos < end);
>+
>+ if (mAddSpace) {
>+ // if we had previously been asked to add space, our situation has not changed
Long line.
>+ }
>+ else if (!sawBlankOrTab && mMayIgnoreLineBreakSequence) {
>+ // nothing to do
>+ mMayIgnoreLineBreakSequence = PR_FALSE;
>+ }
>+ else if (mayIgnoreStartOfLineWhitespaceSequence) {
Trailing whitespace.
>+ // nothing to do
>+ mayIgnoreStartOfLineWhitespaceSequence = PR_FALSE;
>+ }
>+ else {
>+ if (sawBlankOrTab) {
>+ if (mColPos+1 >= mMaxColumn) {
Spaces around +.
>+ // no much sense in delaying, we only have one slot left,
Trailing whitespace.
>+void nsHTMLContentSerializer::AppendWrapped_NonWhitespaceSequence(
>+ nsASingleFragmentString::const_char_iterator &pos,
>+ const nsASingleFragmentString::const_char_iterator end,
>+ const nsASingleFragmentString::const_char_iterator sequenceStart,
>+ PRBool &mayIgnoreStartOfLineWhitespaceSequence,
>+ nsAString& aOutputStr)
>+ do {
>+ onceAgainBecauseWeAddedBreakInFront = PR_FALSE;
>+ foundWhitespaceInLoop = PR_FALSE;
>+
>+ do {
>+ PRUnichar c = *pos;
>+ if (' ' == c || '\t' == c || '\n' == c) {
Nit: the old way (if (*pos == ' ' || ...)) looks nicer to me.
>+ foundWhitespaceInLoop = PR_TRUE;
>+ break;
>+ }
>+ else {
>+ ++pos;
>+ ++mColPos;
>+ }
>+ } while (mColPos < mMaxColumn &&
>+ pos < end);
No need to put the conditions on two lines.
>+
>+ if (pos == end || foundWhitespaceInLoop) {
>+ // there is enough room for the complete block we found
>+
>+ if (mAddSpace) {
>+ aOutputStr.Append(PRUnichar(' '));
>+ mAddSpace = PR_FALSE;
>+ }
>+
>+ aOutputStr.Append(sequenceStart, pos - sequenceStart);
>+ // We have not yet reached the max column, we will continue to
>+ // fill the current line in the next outer loop iteration.
>+ }
>+ else { // mColPos == mMaxColumn
>+
Remove this new line.
>+ if (!thisSequenceStartsAtBeginningOfLine && mAddSpace) {
>+ // We can avoid to wrap.
>+
>+ aOutputStr.Append(mLineBreak);
>+ mAddSpace = PR_FALSE;
>+ pos = sequenceStart;
>+ mColPos = 0;
>+ thisSequenceStartsAtBeginningOfLine = PR_TRUE;
>+ onceAgainBecauseWeAddedBreakInFront = PR_TRUE;
>+ }
>+ else
>+ {
Move this accolade up one line.
>+ // we must wrap
>+
>+ PRBool foundWrapPosition = PR_FALSE;
>+
>+ if (mLineBreaker) { // we have a line breaker helper object
>+ PRUint32 wrapPosition;
>+ PRBool oNeedMoreText;
needMoreText?
>+ nsresult rv;
>+
>+ rv = mLineBreaker->Prev(sequenceStart,
>+ (end - sequenceStart),
>+ (pos - sequenceStart)+1,
Spaces around the +.
>+ if (!mLineBreaker || !foundWrapPosition) {
>+ // try some simple fallback logic
>+ // go forward up to the next whitespace position,
>+ // in the worst case this will be all the rest of the data
>+
>+ do {
>+ PRUnichar c = *pos;
>+ if (' ' == c || '\t' == c || '\n' == c) {
See nit above.
>+ break;
>+ }
>+ else {
>+ ++pos;
>+ ++mColPos;
>+ }
>+ } while (pos < end);
>+
>+ if (mAddSpace) {
>+ aOutputStr.Append(PRUnichar(' '));
>+ mAddSpace = PR_FALSE;
>+ }
>+
>+ aOutputStr.Append(sequenceStart, pos - sequenceStart);
>+ }
>+ }
>+ }
>+ }
>+ while (onceAgainBecauseWeAddedBreakInFront);
Be consistent, move the while behind the accolade.
>+}
>+
>+void
>+nsHTMLContentSerializer::AppendToStringWrapped(const nsASingleFragmentString& aStr,
>+ nsAString& aOutputStr,
>+ PRBool aTranslateEntities)
>+{
>+ // if beginning of a whitespace sequence
>+ if (' ' == c || '\n' == c || '\t' == c) {
See nit above.
>Index: mozilla/content/base/src/nsHTMLContentSerializer.h
>===================================================================
>@@ -139,6 +153,9 @@
> PRInt32 mMaxColumn;
>
> nsString mLineBreak;
>+
>+ PRBool mNeedLineBreaker;
You could make this a PRPackedBool and move it next to mInCDATA.
Assignee | ||
Comment 69•22 years ago
|
||
Changed everything, except the
' ' == *aPos
notation.
I prefer it, because it protects me from doing an accidential assignment.
Attachment #122481 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #122481 -
Flags: superreview?(peterv)
Assignee | ||
Updated•22 years ago
|
Attachment #123398 -
Flags: superreview?(peterv)
Attachment #123398 -
Flags: review+
Comment 70•22 years ago
|
||
> I prefer it, because it protects me from doing an accidential assignment.
That's what testing and reviewers are for. content code (and its owners/peers)
generally prefers the variable on the left hand side of a comparison. And more
to the point, we hate seeing style clashes in our files. People (such as jst)
run through the content/ code randomly to clean this sort of stuff up...
Assignee | ||
Comment 71•22 years ago
|
||
> That's what testing and reviewers are for.
Why didn't the review process then find the incorrect code in my patch v2?
+ if (foundWrapPosition && 0 == wrapPosition) {
+ foundWrapPosition == PR_FALSE;
+ }
In the end it's the coder who is responsible for code correctness.
The compiler found that for us.
> content code (and its owners/peers)
> generally prefers the variable on the left hand side of a comparison.
> And more to the point, we hate seeing style clashes in our files.
I don't get the point how safe code can be assumed as being incorrect style.
> People (such as jst)
> run through the content/ code randomly to clean this sort of stuff up...
I already have r=jst ;-)
Comment 72•22 years ago
|
||
Comment on attachment 123398 [details] [diff] [review]
Patch v5
>Index: mozilla/content/base/src/nsHTMLContentSerializer.cpp
>===================================================================
>@@ -78,12 +83,14 @@
> }
>
> nsHTMLContentSerializer::nsHTMLContentSerializer()
>+:mIndent(0)
>+,mColPos(0)
>+,mInBody(PR_FALSE)
>+,mAddSpace(PR_FALSE)
>+,mMayIgnoreLineBreakSequence(PR_FALSE)
>+,mInCDATA(PR_FALSE)
>+,mNeedLineBreaker(PR_TRUE)
Operators need to be add end of line and add some spaces:
: mIndent(0),
mColPos(0),
etc,
>+ do {
>+ PRUnichar c = *aPos;
>+ if (' ' == c || '\t' == c || '\n' == c) {
I still don't understand why you use that local var though (here and further
below).
Attachment #123398 -
Flags: superreview?(peterv) → superreview+
Assignee | ||
Comment 73•22 years ago
|
||
Attachment #123398 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #123500 -
Flags: superreview+
Attachment #123500 -
Flags: review+
Attachment #123500 -
Flags: approval1.4?
Assignee | ||
Updated•22 years ago
|
Attachment #123500 -
Flags: superreview+
Attachment #123500 -
Flags: review+
Attachment #123500 -
Flags: approval1.4?
Assignee | ||
Comment 74•22 years ago
|
||
Addressed all comments.
Attachment #123500 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #123504 -
Flags: superreview+
Attachment #123504 -
Flags: review+
Attachment #123504 -
Flags: approval1.4?
Comment 75•22 years ago
|
||
> I don't get the point how safe code can be assumed as being incorrect style.
It can be hard to read. Furthermore, gcc warns about unparenthesized assignment
in if, while, etc., conditionals.
Comment 76•22 years ago
|
||
peterv, re: comment 72: I use local vars to "common" dereferenced pointers, even
if a decent compiler can do it for me, just for brevity's sake, and in case
memory ambiguity (possibility of aliasing, or side effects in functions called)
would make for a needless reload (not the case here, but the pattern's the thing).
/be
Comment 77•22 years ago
|
||
Comment on attachment 123504 [details] [diff] [review]
Patch v7
This looks like the right thing to do so we're approving it for 1.4. But it
also looks large so please keep a close eye on nightly build and 1.4 RC1
feedback for any regressions.
Attachment #123504 -
Flags: approval1.4? → approval1.4+
Comment 78•22 years ago
|
||
Brendan: I stand corrected, I'll keep that nit to myself in the future :-). I
find it confusing because the value of the dereferenced pointer isn't changing
but with the local var there's two values to keep track of.
Comment 80•22 years ago
|
||
Did this land yet? I don't see anything in bonsai.
Whiteboard: edt_x3 → edt_x3 landed1.4?
Assignee | ||
Comment 81•22 years ago
|
||
I'll check it in today.
Assignee | ||
Comment 82•22 years ago
|
||
fixed on trunk
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Whiteboard: edt_x3 landed1.4? → edt_x3
You need to log in
before you can comment on or make changes to this bug.
Description
•