Closed Bug 200582 Opened 22 years ago Closed 22 years ago

No line wrap for text without spaces in nsHTMLContentSerializer.cpp

Categories

(Core :: DOM: Editor, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.4beta

People

(Reporter: nhottanscp, Assigned: KaiE)

References

Details

(Keywords: embed, topembed+, Whiteboard: edt_x3)

Attachments

(2 files, 8 obsolete files)

In Mozilla mail (HTML), sending out long lines which do not include spaces (e.g. Japanese text) do not wrap. E.g. the next line has 100 characters and it does not wrap. 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
http://lxr.mozilla.org/seamonkey/source/content/base/src/nsHTMLContentSerializer.cpp#740 The code is trying to find a white space for a line break. Note that mMaxColumn is not checked here. 740 // if we're past the wrapping width with no place to wrap at, 741 // find the next whitespace and wrap there 742 while (pos != end && *pos != ' ' && *pos != '\n') { 743 ++pos; 744 } 745 mAddSpace is false so going to else 746 if (mAddSpace) { 747 // whitespace was needed before the next segment, so we can put 748 // a newline instead of a space, and avoid getting a lone line 749 aOutputStr.Append(mLineBreak); 750 addLineBreak = PR_FALSE; 751 752 mColPos = pos - segStart; 753 754 // if the string doesn't end in whitespace, set mAddSpace to false 755 if (pos == end) { 756 mAddSpace = PR_FALSE; 757 } 758 } Here is set addLineBreak to true. But at that point the position reaches at end, so no linebreak is added before returning at line 766. But the loop at line 742 does not do a column check, the line could be very long even if we change to put one linebreak before return. 759 else { 760 // no choice but to write a long line and wrap immediately after it 761 addLineBreak = PR_TRUE; 762 } 763 aOutputStr.Append(segStart, pos - segStart); 764 765 if (pos == end) { 766 return; 767 }
At the loop at line 742, nsILineBreaker may be used to get a linebreak for text without spaces like Japanese text, that is already used in nsPlainTextSerializer. It can also check like (mMaxColumn * 2) to prevent the line to be very long.
Keywords: embed
See also bug 56921, which covers a bug in this code which makes ascii lines too long.
The patch is to prevent very long lines which may cause data corruption for encoded Japanese text (e.g. ISO-2022-JP) when the server forces line breaks. I think the integration of the line breaker for better formatting can be done separately by someone more familiar with the code.
Attachment #119697 - Flags: review?(akkana)
I'm not sure I'm understanding this correctly. Why doesn't the while (mColPos < mMaxColumn) take care of this -- why isn't the new check redundant with that one? By the time we get to the place where the patch adds the new check, isn't mColPos >= mMaxColumn already, without seeing a space? So the change made by the patch is that if there wasn't a space in the line, the line will be forcibly broken in the middle of a word, whether or not a space was seen? Am I understanding that correctly, and if so, is that really a change we want to make? Will it cause a line of 79 ascii dashes to be broken into long-short pairs (with the default wrapcol of 72)? I don't have a tree that builds right now, and we're having unexplained network slowness today so it may be a while before I can update and test it. I wish we could just do wrapping right, using nsILineBreaker like in the plaintext serializer and searching backwards as described in bug 56921, instead of applying more band-aids on top of the existing band-aid layers. Burpmaster, do you have time to glance at the patch and see if I'm misinterpreting? You're probably the most familiar with this code.
>the patch is that if there wasn't a space in the line, the line will be forcibly >broken in the middle of a word, whether or not a space was seen? Am I Yes, for Japanese text that is okay (and this is for HTML source not affect actual display), for English it could be a problem but it only happens if the word is very long like greater than 72, so I assume practically okay. >I wish we could just do wrapping right, using nsILineBreaker like in the >plaintext serializer and searching backwards as described in bug 56921, instead >of applying more band-aids on top of the existing band-aid layers. I agree if the right implementation is available soon then the current patch would not be needed. If the wrapping has to happen within the column then the search has to go backwards. And the line breaker will take care Japanese text but a long word without a space like I mentioned in the original report would not be broken even with the line breaker.
nsbeta1 The data corruption I mentioned in comment #4 is generic and can happen with Mozilla mail.
Keywords: nsbeta1
Comment on attachment 119697 [details] [diff] [review] When looping for white space, also check for max line. Sorry, but no, I can't approve breaking wrapping latin text just because nobody wants to spend the time writing code that calls nsILineBreaker. Lots of people send plaintext mail with long separator lines of "*" or "=" or whatever, as well as ascii art, but more important, people also send long urls in plaintext mail, and breaking them means that the url can't be pasted into a browser.
Attachment #119697 - Flags: review?(akkana) → review-
As I mentioned before, the line breaker can handle the Japanese text case but not necessary can break a very long Ascii word like in my original report (see my comment #4).
Leaving long words unbroken is intentional: breaking them at the normal wrapcol of 72 is bad, especially on urls. See, for example, bug 137253. In fact, we did that at one time, and got bugs reported on it. I could see an argument for having a separate limit for forcibly breaking text (e.g. at 512 bytes or whatever mail and news servers need), but that's a separate issue, and probably harder than just calling the line breaker.
I see, then hooking up the line breaker is the way that will take care the Japanese text case at least.
I looked at this again and found that Japanese text do not go to the function (AppendToStringWrapped) we have looked at so far. There is a function nsHTMLContentSerializer::AppendText which calls AppendToStringConvertLF or AppendToStringWrapped. For Japanese text, AppendToStringConvertLF is called and that function does not do line wrapping. http://lxr.mozilla.org/seamonkey/source/content/base/src/nsHTMLContentSerializer.cpp#162 So, in addition to change AppendToStringWrapped, it is also required to change the caller or somewhere else to call AppendToStringWrapped for long Japanese text. This really has to be looked by someone who is familiar with that area. Adding 'topembed' keyword.
Keywords: topembed
-> me
Assignee: jfrancis → kaie
>Leaving long words unbroken is intentional: breaking them at the normal wrapcol >of 72 is bad, especially on urls. If it's intentional then is this bug a WONTFIX?
That may apply to English text but not for Japanese which usually do not contain spaces. So the current word breaking logic in nsHTMLContentSerializer cannot word break Japanese text. For Japanese HTML mail, it may cause data corruption when the server forces linebreaks at certain byte length (e.g. breaking middle by multibyte character).
Discussed in edt bug triage. Plussing. Can some one please put the correct Japanese behavior in this bug?
Keywords: topembedtopembed+
Whiteboard: edt_x3
>So the current word breaking logic in nsHTMLContentSerializer cannot >word break Japanese text. Correction: I meant for line break.
Expected behavior for Japanese text is to line break at the specified column even without a space.
Even if nsILineBreaker says it's not a valid place to break? Or can we trust the line breaker to tell us when it's safe to break? I hope the serializer doesn't have to start encoding special knowledge about different charsets.
addtional info for comment #14, by using nsILineBreaker, it finds a linebreak without a space for Japnese text
Email comment from nhotta: I assume it is not that common to keep typing Japanese text without hitting a return. But some user may do that (or pasting such text is also possible). This is dataloss, should be P2 (or higher).
Priority: -- → P2
Attached patch Long Japanese text for testing. (obsolete) (deleted) — Splinter Review
Copy the entire text and paste it into HTML mail compose and send.
Attachment #121322 - Attachment is obsolete: true
Had a meeting (akkana, harishd, jst, kaie, nhotta), decided to use nsILineBreaker parsing backward.
I tested by copying the attached text to HTML mail (mail charset set to UTF-8). The problem I mentioned in comment #12 was not seen. The function AppendToStringWrapped() was called for the Japanese text (and prelevel was zero). So, the Japanese text would be wrapped if AppendToStringWrapped() change to use nsILineBreaker.
I think this bug might already be fixed on the latest trunk. To reproduce I: - opened attachment 121324 [details] from this bug - select all, copy - open new mail message - (charset) - paste - send - receive The results I see in the arriving message depend on the version of Mozilla and the charset that has been set for the message, before pasting. Latest Mozilla trunk, message charset iso-8859-1: No bug, I think. On send, I get a warning "message contains chars not contained in selected coding". If I decide to send anyway, the mail that arrives consists completely of "?" questions marks. One could say this is a bug, too, not sure. But at least, it's not the bug reported here. Latest Mozilla trunk, message charset set to utf-8: Everything seems to be perfect. The message that arrives displays all the same character. The rendered display shows some spaces in it, they come from linebreaks inserted on send. When looking at the message source, the message looks correct, too. The message is actually base64 encoded. But when you do a manual decode (I did this using command line tools) you see that the underlying encoding is utf-8 and each line has at most 990 characters. There is a trick you can see the correctness without going through the step of manual base64 decoding. Send the above message as a "signed" message. In this case, no base64 encoding is used, but 8bit, and you can save the message directly, and see that 990 bytes is max per line. Mozilla 1.3 branch build, message charset set to utf-8: Works identical as in previos section Mozilla 1.3 branch build, message charset iso-8859-1: This is the only combination that looks like the bug described here. The received message has line breaks in the middle of &#12354; html entities. That produces the wrong appearance. For what it's worth, I also tried the above with a Mozilla 1.2 based version. I can not even start to reproduce the bug. When I attempt to paste the japanese text into a mail message, only garbage is shown in the editor as a result. nhotta: Can you confirm the bug is fixed with the latest trunk?
I talked to Naoki. The latest Mozilla trunk I was using automatically sent the message as plain text, because it contained no tags. That was most likely the reason why it worked for me. In order to reproduce the bug, in addition to pasting the long content, one must use at least one HTML style, like, make the first char bold or a larger font. I now see some problems, continuing to investigate.
Additional step for the test case in comment #22 (sorry I forgot to mention), "make the fist character as Bold to ensure the message is sent as text/html" also set charset to "ISO-2022-JP" makes the problem happen more often
While thinking about this bug, I was wondering, how would line breaking behave when there is a mixture of japanese text and long URLs in the body? I've filed bug 203122, not sure if it is valid.
I'm close to having a patch. I'd like to describe what the consequences of the patch are. The wrap column for HTML source is set to 72. We already wrap tags and latin words at whitespace bondaries. The line breaker code seems to correctly identify asian characters that can be used as a wrap position. That means, by using the line breaker we don't break URLs - at least we don't break URLs that are separated by whitespace from asian characters. However, we will introduce a new behaviour that will make the wrapped text look ugly, I think. Naoki, I wonder if this behaviour is ok. By inserting line breaks in the middle of japanese text, we insert significant whitespace. Imagine a document with 200 japanese characters, first char bold. The wrap column is 72. When counting the length of the line, the wrap code does not distinguish between tags and text. That means, the more text tags there are on a line of japanese text, the less text characters will stay on the line after wrapping. <b>x</b>xx-199-times-xx This will be broken into: <b>x</b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx However, because the bold tags are not visual, the japanese text will look like this: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx That means, it will appear to the user that we are wrapping at random positions in the document. If there are HTML tags in the document, the right border can be different on each line. But because we are inserting significant whitespace, there are more effects when the user continues to edit the document after saving and opening. Imagine the user inserts 10 more japanese characters in the first line (the line with the bold tag). While editing, we do not yet break, and the document will look like that: xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The first line is now longer than the wrap column, the other lines are still short enough. When we save, the code will wrap the document again. The results are: xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Question: Is the above behaviour, that will be a side effect of the patch, acceptable? If you don't like the behaviour, it would be tricky to avoid it. I see two things we'd have to do: - "unwrap" when editing. That is, when loading a document, remove every single line break between two japanese characters, if there is nothing else between the characters than the single line break. Unfortunately, that would also erase any line breaks the user has typed explicitly. - no not count HTML tag characters when searching for the wrap position Finally my questions is, what do we want? 1) Land the patch and accept that we will see lines with Japanese text of different lengths. 2) Do 1), and file a new bug on the "unwrap" and "don't count HTML tags to implement wrapping" suggestions. This would be done at a later time in another step. 3) All problems must be solved at once, delay checking in the wrapping fix, until we have a solution for all the problems described.
Kai, I have a question. Does the inconsistent line length by HTML tags affect actual display when sent as text/html or the lines are not aligned only in message source? >But because we are inserting significant whitespace, there are more effects when >the user continues to edit the document after saving and opening. do you mean whitespace as CRLF?
> Does the inconsistent line length by HTML tags affect actual display when sent > as text/html or the lines are not aligned only in message source? It will affect the actual display, because the wrap logic counts any chars in the source. >> But because we are inserting significant whitespace, there are more effects >> when the user continues to edit the document after saving and opening. > do you mean whitespace as CRLF? Yes, both CRLF and SPACE in HTML source have the same visual effect when the document is actually displayed. If we insert CRLF to break the line, this will result in a space to appear when the document is displayed.
To clarify my long explanation, I was unclear when talking about what's displayed when the document is displayed. I said "because the bold tags are not visual, the japanese text will look like this:" xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx You might see spaces instead of linebreaks in the actual display. I also said, after doing two save/edit cycles with the example document, the document will look like that: xxxxxxxxxxxxxNNNNNNNNNNxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Again, instead of line breaks, you could see a space between those blocks of japanese characters.
Kaie, you're saying that a line break between two characters in the html source always displays as a line break in the displayed html when displaying Japanese? That would be odd; it's certainly not true in English/Latin text (though it does usually show up as whitespace). Why is that different in Japanese?
> Kaie, you're saying that a line break between two characters in the html source > always displays as a line break in the displayed html when displaying Japanese? > That would be odd; it's certainly not true in English/Latin text (though it > does usually show up as whitespace). Why is that different in Japanese? I was wrong when I said "will be shown as line breaks". What I want to say is: "The line breaks we introcuce will show up as spaces". Although the japanese user did not enter any spaces, we will show spaces in the document.
So actual space character is added for line break or CRLF added to the source is shown as a space when displayed?
Oops. I have to apologize, sorry for the confusion. The effects I saw, the shown spaces, were caused by hard spaces the preliminary version of my line wrap patch is inserting into the document. :-( I did not see that and made the incorrect conclusion about the behaviour of the layout engine. Thanks for your comments, you helped me to understand it better. Actually, the layout engine already seems to do what I suggested. When there is a CRLF in the source between two japanese characters, nothing is rendered in the display. So all is well. I will come up with a patch shortly.
>Actually, the layout engine already seems to do what I suggested. I think the change (not to show CRLF as a space for Japanese text) is done by after NS7.0. The current trunk does not show spaces. So we might need some extra work to support 7.0 or earlier versions.
Attached patch Patch v2 (obsolete) (deleted) — Splinter Review
I believe this patch is correct. This is a rewrite of the AppendStringWrapped method. I began to tweak the original method but the existing code looked to confusing to me, so I decided I'm better off by rewriting it. There are two basic ideas used in the patch - as it has been suggested, when searching for a wrap position, use the line breaker helper class, and only fall back to a simpler approach if it doesn't help - never translate a sequence of end-of-line characters (with no spaces or tabs) into a space. This protects us from visible space showing up. Inserting linebreaks is "better" than space, because the layout engine obviously is smart enough to not show linebreaks as whitespace between asian characters. I also added a lot of comments to the new code, in the hope the new code will be more easily maintainable. I have also added the tab character as a valid whitespace character. The existing code only checked for ' ' and '\n'. Checking for '\t' in addition is correct, right?
Attachment #119697 - Attachment is obsolete: true
I tested with simple and long documents, long and short latin1 words and URLS, and mixed documents, containing both asian and latin characters. I also tested a large web document from http://www.spiegel.de One thing I noticed is a slightly new behaviour, with the rewrite, I think it's now better than before: When the original document has a sequence of <!-- x --> <!-- x2 --> the existing serializer transforms that into <!-- x --><!-- x2 --> The new code no longer does that.
Any help in reviewing this patch mostly appreciated!
Comment on attachment 121713 [details] [diff] [review] Patch v2 Akkana offered to have a look at the patch. Thanks a lot. Burpmaster, do you have time to look at the code, too? Akkana said, you have a good knowledge of the wrapping code.
Attachment #121713 - Flags: review?(akkana)
ftang can you take a look too?
I reread my patch, I want to make the following minor changes to the patch: - mColPos += pos - sequenceStart; is adding too much, mColPos is already at mMaxColumn, fix this by removing that line and adding a ++mColPos next to ++pos in the loop. - all do{switch{}}while(); loops can optimized by moving the ++ statements into the default: part of the switch block.
Attachment #121713 - Attachment is obsolete: true
Attachment #121713 - Flags: review?(akkana)
Attached patch Patch v2b (obsolete) (deleted) — Splinter Review
Slightly updated patch, as described in previous comment.
Attachment #121765 - Flags: review?(akkana)
The displaying space problem I mentioned in comment #38, it happens with 7.0. If I reply quote the HTML as plain text then actual spaces are inserted for the original CRLF in the HTML source. The current trunk does not have the problem. I think we need to do something about it to support NS 7.0 users.
I don't have a good idea how to make our messages appear without spaces in the display in Mozilla 1.0 based products. Instead of wrapping with a line break char in the HTML source, we could also write a <br> tag. This would cause Mozilla 1.0 to do a hard break whereever it currently displays a space. This would make the spaces in the display go away, but the lines will no longer be aligned at the right side. But I think we don't want to insert <br>s into the source. By doing so, we would change the content.
>But I think we don't want to insert <br>s into the source. Right, CRLF is needed in order to avoid smtp server to force line break (that is 1000 bytes according to RFC 2821 4.5.3.1). I do not have a solution for the problem either. A possible workaround is to make line break to occur less frequently. E.g. if the break point is not a space or CRLF then search forward for another break point until hit a certain limit (e.g. 256 character).
> A possible workaround is to make line break to occur less frequently. > E.g. if the break point is not a space or CRLF then search forward for another > break point until hit a certain limit (e.g. 256 character). I fear we can't do that. A japanese characters is stored as a byte sequence of multiple chars in HTML, the char in the attached testcase is stored as &#12354; With the current default of wrapping at 72 characters, if a line contains only japanese characters, we already have lines that are 576 bytes long! If we want to stay below the max allowed length for mail messages, we could at most go to 123 allowed characters per line. I personally think, going from 72 to 123 does not improve our situation a lot. And isn't our situation already better? If I understand correctly, one japanese characters take up the same space as two latin characters. We still wrap at 72 characters, which is double the visual width as with latin text. That is, while we wrap at the width of 72 latin characters for latin text, we wrap at the width of 144 latin characters for japanese text. I'd prefer if we accept that users using Mozilla 1.0 based software will see some spaces, that we describe the behaviour in Mozilla 1.0 is a bug, and recommend users to upgrade.
Note, the 1.2 branch behaves good, and does not show the spaces!
>A japanese characters is stored as a byte sequence of multiple chars in HTML, >the char in the attached testcase is stored as &#12354; That happens when the charset cannot encode the Unicode code point. For the attached HTML, the charset is UTF-8 and a character is encoded as three bytes as 0xE38182 in that test case. So usually one character is encoded less than 8 bytes as &#12354; but it is true that such case can happen. But I see your point about space seen per 144 column. If that is acceptable or not that is not easy to tell. Without the fix we see data corruption when the line exceeds 1000 bytes, with the fix 7.0 users see spaces every 72 characters.
The other fact is that HTML composed mail with text only is sent as text/plain and the problem does not happen. So, I think the new problem happens more often than the data corruption but still does not happen so frequently. And as Kai mentioned the user has an option to upgrade.
I propose we land this patch. The patch is an improvement, and we don't have a perfect idea currently, only ideas how to slightly raise the limit how soon the new problem will be seen. The new problem is only of cosmetic nature (spaces). All data will be readable by the recipient. I think we should not care too much about Mozilla 1.0 in this case, because Mozilla 1.0 is broken anyway, when typing long lines. We can't fix that in 1.0. If we have a better idea at a later time, we can still implement it at a later time. Naoki, do you agree to take this patch as a first step to improve the situation?
Kai, I agree with you going forward with the current patch. We need to fix the data corruption. But I have to mention that the cosmetic problem (spaces) is not a minor issue.
Attachment #121765 - Flags: review?(akkana) → review?(jst)
I agree too. We need to fix the bug. It'd be nice if we had a solution that would not reveal the Moz 1.0 bug in msgs sent from Moz 1.4, but we don't. It's also not clear how often we will see an uninterrupted sequence of more than 72 kanji/kana in HTML mail messages. > I personally think, going from 72 to 123 does not improve our situation a lot. I don't know how much it improves, but it would be an improvement. As you note the column width is double for these characters, so it would be changing if from 144 to 246 columns. So why not take it, even if it is a small improvement?
> I don't know how much it improves, but it would be an improvement. As you > note the column width is double for these characters, so it would be changing > if from 144 to 246 columns. So why not take it, even if it is a small > improvement? The point is, there is not a single place in the source that we could change from 72 to 123, but there are multiple places in the code that specify 72 or other numbers. The API to the content serializer takes the requested width as an argument, and various code all over Mozilla makes use of that parameter and passes in the desired width. The value 72 only happens to be the default in the case we are looking at. If we say, we want 123 as the desired number, this would effectively mean, we remove the flexibility from the API, and have to hardcode the number into the serializer - because we don't want smaller numbers, and we can't allow larger numbers. I don't know the existing good code good enough to judge, whether we would introduce regressions by removing the parameter from the API and fixing the value to 123.
My conclusion from my previous comment: I think the few % likelyhood reduction we is not worth the trouble. Hardcoding the value is not an option, I think. IMHO, the only reasonable approach to increase the number is: Investigate and find all callers who call the function and pass in the wrap column as a parameter, and change the calling code to possibly use larger numbers.
Kaie is right that hardcoding is not a good option. In mozilla mail, the wrap column should be following the preference "mailnews.wraplength". Any "72" that's compiled into the source is be purely a fallback for embedding apps which don't control wrap length; and any embedding app that cares about such things should probably have a central place to control it (e.g. a pref like mozilla uses, or a single point from which the serializer is called after checking things lke wrap column and html-vs-plaintext).
Comment on attachment 121765 [details] [diff] [review] Patch v2b + nsresult rv; + nsCOMPtr<nsILineBreakerFactory> lf(do_GetService(kLWBrkCID, &rv)); + if (NS_SUCCEEDED(rv)) { + nsAutoString lbarg; + rv = lf->GetBreaker(lbarg, getter_AddRefs(mLineBreaker)); + if (NS_FAILED(rv)) return NS_ERROR_FAILURE; + } If a document is reachable here, you should ask it for the line breaker, if there is no document, or it has no line breaker, then get the service and ask it. And in stead of creating an empty nsAutoString object to pass to GetBreaker(), simply pass nsString() in stead, less code, and faster code. - In nsHTMLContentSerializer::AppendToStringWrapped(): The new logic here looks good to me, but it's non-trivial, so there's really no way to know for sure that this works w/o doing a lot of testing on a lot of data. As for this method, it suffers from a very common problem we see all over Mozilla code, it's way too long, and that makes it really hard to read. Could it be split up into two or more methods? Say one that handles a whole sequence of whitespace, and one that handles non-whitespace. That way you have one simple loop that calls either of the two above functions, depending on if we're dealing with whitespace or not. I have a few style-nits with the code, see below: + do { ... + } + while (...); Personally, I don't like that style at all, I'd want the while to be on the same line as the closing brace, i.e.: + do { ... + } while (...); That way it's obvious that the while belongs to the above do. There's a few of those in this code. + do { + switch (*pos) { + case ' ': + case '\t': + case '\n': + foundWhitespaceInLoop = PR_TRUE; + break; + default: + ++pos; + ++mColPos; + break; + } + } + while (!foundWhitespaceInLoop && + pos < end && + mColPos < mMaxColumn); Same here, and make the last two lines of the while condition line up with the opening paren. And if you'd make the switch in that loop an if, you could simply break out of the loop when you find some whitespace and avoid checking a boolean, in addition to the other checks, every iteration. + result = mLineBreaker->Prev(sequenceStart, + (end - sequenceStart), + (pos - sequenceStart)+1, + &wrapPosition, + &oNeedMoreText); + if (NS_SUCCEEDED(result) && !oNeedMoreText) { + foundWrapPosition = PR_TRUE; + } + else { + result = mLineBreaker->Next(sequenceStart, + (end - sequenceStart), + (pos - sequenceStart), + &wrapPosition, + &oNeedMoreText); + if (NS_SUCCEEDED(result) && !oNeedMoreText) { + foundWrapPosition = PR_TRUE; + } + } What if oNeedMoreText? Shouldn't happen, right? Maybe assert that it doesn't? + do { + switch (*pos) { + case ' ': + case '\t': + case '\n': + foundWhitespaceInLoop = PR_TRUE; + break; + default: + ++pos; + ++mColPos; + break; + } + } + while (!foundWhitespaceInLoop && pos < end); Same thing here, move the while, and you could avoid the need to check !foundWhitespaceInLoop if you change the switch to an if and just break out of the loop when you hit whitespace. Other than that, this looks really good.
Attachment #121765 - Flags: review?(jst) → review-
Comment on attachment 121765 [details] [diff] [review] Patch v2b I have a comment of my own. >+ if (pos == end) { >+ return; >+ } This is no longer needed because the while loop won't be entered in this case. >+ while (pos < end) {
Attached patch Patch v3 (obsolete) (deleted) — Splinter Review
> If a document is reachable here, you should ask it for the line breaker, if > there is no document, or it has no line breaker, then get the service and ask > it. The only way I found to reach a document was from nsIDOMText to nsIContent nsIDocument. Changed. > And in stead of creating an empty nsAutoString object to pass to > GetBreaker(), simply pass nsString() in stead, less code, and faster code. done > The new logic here looks good to me, but it's non-trivial, so there's really no > way to know for sure that this works w/o doing a lot of testing on a lot of > data. true > As for this method, it suffers from a very common problem we see all over > Mozilla code, it's way too long, and that makes it really hard to read. Could > it be split up into two or more methods? Say one that handles a whole sequence > of whitespace, and one that handles non-whitespace. done, split into two methods > + } > + while (...); > change to > + } while (...); done > + while (!foundWhitespaceInLoop && > + pos < end && > + mColPos < mMaxColumn); > And if you'd make the switch in that loop an if, you could > simply break out of the loop when you find some whitespace and avoid checking a > boolean, in addition to the other checks, every iteration. done, both places > What if oNeedMoreText? Shouldn't happen, right? Maybe assert that it doesn't? In my tests, it did happen!. But no need for an assertion. If we don't set foundWrapPosition, we will fall back to the trivial break logic. > + if (pos == end) { > + return; > + } > This is no longer needed because the while loop won't be entered in this case. Removed
Attachment #121765 - Attachment is obsolete: true
Attachment #122290 - Flags: review?(jst)
Comment on attachment 122290 [details] [diff] [review] Patch v3 + nsCOMPtr<nsIContent> content = do_QueryInterface(aText); + if (content) { + nsCOMPtr<nsIDocument> document; + content->GetDocument(*getter_AddRefs(document)); + if (document) { + document->GetLineBreaker(getter_AddRefs(mLineBreaker)); + } + } Using GetOwnerDocument() on aText would be a bit more reliable, i.e.: + nsCOMPtr<nsIDOMDocument> domDoc + aText->GetOwnerDocument(getter_AddRefs(domDoc)); + nsCOMPtr<nsIDocument> document = do_QueryInterface(domDoc); + if (document) { + document->GetLineBreaker(getter_AddRefs(mLineBreaker)); + } Further down: + do { ... + } while (pos < end && + mColPos < mMaxColumn); Isn't the latter condition more common? If so, make it the first condition... + nsresult result; Nit: Wanna use "rv" here (and other places too where you change related code) in stead of "result" (like we do in most places) to make lines a bit shorter? r=jst
Attachment #122290 - Flags: review?(jst) → review+
Attached patch Patch v4 (obsolete) (deleted) — Splinter Review
Patch having the requested changes.
Attachment #122290 - Attachment is obsolete: true
Comment on attachment 122481 [details] [diff] [review] Patch v4 carrying forward r=jst Peter, could you please sr= ?
Attachment #122481 - Flags: superreview?(peterv)
Attachment #122481 - Flags: review+
Comment on attachment 122481 [details] [diff] [review] Patch v4 Forgot to mention one nit: + do { + PRUnichar c = *pos; + if (' ' == c || '\t' == c || '\n' == c) { + foundWhitespaceInLoop = PR_TRUE; + break; + } + else { + ++pos; + ++mColPos; + } + } while (mColPos < mMaxColumn && + pos < end); No need for else after break there, i.e.: + do { + PRUnichar c = *pos; + if (' ' == c || '\t' == c || '\n' == c) { + foundWhitespaceInLoop = PR_TRUE; + break; + } + + ++pos; + ++mColPos; + } while (mColPos < mMaxColumn && + pos < end); And same thing here: + do { + PRUnichar c = *pos; + if (' ' == c || '\t' == c || '\n' == c) { + break; + } + else { + ++pos; + ++mColPos; + } + } while (pos < end); My review still applies.
My newest local patch file has the requested nit corrected.
Kai, wil this be targetted to land on the 1.5 train? Can you set the appropriate target milestone?
Target Milestone: --- → mozilla1.4beta
Blocks: PhtN7
Comment on attachment 122481 [details] [diff] [review] Patch v4 >Index: mozilla/content/base/src/nsHTMLContentSerializer.cpp >=================================================================== >@@ -82,8 +87,10 @@ > mColPos = 0; > mIndent = 0; > mAddSpace = PR_FALSE; >+ mMayIgnoreLineBreakSequence = PR_FALSE; > mInBody = PR_FALSE; > mInCDATA = PR_FALSE; >+ mNeedLineBreaker = PR_TRUE; Please make these member initializers. >@@ -188,6 +217,244 @@ > return NS_OK; > } > >+void nsHTMLContentSerializer::AppendWrapped_WhitespaceSequence( >+ nsASingleFragmentString::const_char_iterator &pos, >+ const nsASingleFragmentString::const_char_iterator end, >+ const nsASingleFragmentString::const_char_iterator sequenceStart, >+ PRBool &mayIgnoreStartOfLineWhitespaceSequence, >+ nsAString& aOutputStr) aPos, aEnd, ... Please be consistent, move the &'s next to the type. >+{ >+ // Handle the complete sequence of whitespace. >+ // Continue to iterate until we find the first non-whitespace char. >+ // Updates "pos" to point to the first unhandled char. >+ // Also updates the mayIgnoreStartOfLineWhitespaceSequence flag, >+ // as well as the other "global" state flags. >+ >+ PRBool sawBlankOrTab = PR_FALSE; >+ PRBool leaveLoop = PR_FALSE; >+ >+ do { >+ switch (*pos) { >+ case ' ': >+ case '\t': >+ sawBlankOrTab = PR_TRUE; >+ // no break >+ case '\n': >+ ++pos; >+ // do not increase mColPos, Trailing whitespace. >+ // because we will reduce the whitespace to a single char >+ break; >+ default: >+ leaveLoop = PR_TRUE; >+ break; >+ } >+ } while (!leaveLoop && pos < end); >+ >+ if (mAddSpace) { >+ // if we had previously been asked to add space, our situation has not changed Long line. >+ } >+ else if (!sawBlankOrTab && mMayIgnoreLineBreakSequence) { >+ // nothing to do >+ mMayIgnoreLineBreakSequence = PR_FALSE; >+ } >+ else if (mayIgnoreStartOfLineWhitespaceSequence) { Trailing whitespace. >+ // nothing to do >+ mayIgnoreStartOfLineWhitespaceSequence = PR_FALSE; >+ } >+ else { >+ if (sawBlankOrTab) { >+ if (mColPos+1 >= mMaxColumn) { Spaces around +. >+ // no much sense in delaying, we only have one slot left, Trailing whitespace. >+void nsHTMLContentSerializer::AppendWrapped_NonWhitespaceSequence( >+ nsASingleFragmentString::const_char_iterator &pos, >+ const nsASingleFragmentString::const_char_iterator end, >+ const nsASingleFragmentString::const_char_iterator sequenceStart, >+ PRBool &mayIgnoreStartOfLineWhitespaceSequence, >+ nsAString& aOutputStr) >+ do { >+ onceAgainBecauseWeAddedBreakInFront = PR_FALSE; >+ foundWhitespaceInLoop = PR_FALSE; >+ >+ do { >+ PRUnichar c = *pos; >+ if (' ' == c || '\t' == c || '\n' == c) { Nit: the old way (if (*pos == ' ' || ...)) looks nicer to me. >+ foundWhitespaceInLoop = PR_TRUE; >+ break; >+ } >+ else { >+ ++pos; >+ ++mColPos; >+ } >+ } while (mColPos < mMaxColumn && >+ pos < end); No need to put the conditions on two lines. >+ >+ if (pos == end || foundWhitespaceInLoop) { >+ // there is enough room for the complete block we found >+ >+ if (mAddSpace) { >+ aOutputStr.Append(PRUnichar(' ')); >+ mAddSpace = PR_FALSE; >+ } >+ >+ aOutputStr.Append(sequenceStart, pos - sequenceStart); >+ // We have not yet reached the max column, we will continue to >+ // fill the current line in the next outer loop iteration. >+ } >+ else { // mColPos == mMaxColumn >+ Remove this new line. >+ if (!thisSequenceStartsAtBeginningOfLine && mAddSpace) { >+ // We can avoid to wrap. >+ >+ aOutputStr.Append(mLineBreak); >+ mAddSpace = PR_FALSE; >+ pos = sequenceStart; >+ mColPos = 0; >+ thisSequenceStartsAtBeginningOfLine = PR_TRUE; >+ onceAgainBecauseWeAddedBreakInFront = PR_TRUE; >+ } >+ else >+ { Move this accolade up one line. >+ // we must wrap >+ >+ PRBool foundWrapPosition = PR_FALSE; >+ >+ if (mLineBreaker) { // we have a line breaker helper object >+ PRUint32 wrapPosition; >+ PRBool oNeedMoreText; needMoreText? >+ nsresult rv; >+ >+ rv = mLineBreaker->Prev(sequenceStart, >+ (end - sequenceStart), >+ (pos - sequenceStart)+1, Spaces around the +. >+ if (!mLineBreaker || !foundWrapPosition) { >+ // try some simple fallback logic >+ // go forward up to the next whitespace position, >+ // in the worst case this will be all the rest of the data >+ >+ do { >+ PRUnichar c = *pos; >+ if (' ' == c || '\t' == c || '\n' == c) { See nit above. >+ break; >+ } >+ else { >+ ++pos; >+ ++mColPos; >+ } >+ } while (pos < end); >+ >+ if (mAddSpace) { >+ aOutputStr.Append(PRUnichar(' ')); >+ mAddSpace = PR_FALSE; >+ } >+ >+ aOutputStr.Append(sequenceStart, pos - sequenceStart); >+ } >+ } >+ } >+ } >+ while (onceAgainBecauseWeAddedBreakInFront); Be consistent, move the while behind the accolade. >+} >+ >+void >+nsHTMLContentSerializer::AppendToStringWrapped(const nsASingleFragmentString& aStr, >+ nsAString& aOutputStr, >+ PRBool aTranslateEntities) >+{ >+ // if beginning of a whitespace sequence >+ if (' ' == c || '\n' == c || '\t' == c) { See nit above. >Index: mozilla/content/base/src/nsHTMLContentSerializer.h >=================================================================== >@@ -139,6 +153,9 @@ > PRInt32 mMaxColumn; > > nsString mLineBreak; >+ >+ PRBool mNeedLineBreaker; You could make this a PRPackedBool and move it next to mInCDATA.
Attached patch Patch v5 (obsolete) (deleted) — Splinter Review
Changed everything, except the ' ' == *aPos notation. I prefer it, because it protects me from doing an accidential assignment.
Attachment #122481 - Attachment is obsolete: true
Attachment #122481 - Flags: superreview?(peterv)
Attachment #123398 - Flags: superreview?(peterv)
Attachment #123398 - Flags: review+
> I prefer it, because it protects me from doing an accidential assignment. That's what testing and reviewers are for. content code (and its owners/peers) generally prefers the variable on the left hand side of a comparison. And more to the point, we hate seeing style clashes in our files. People (such as jst) run through the content/ code randomly to clean this sort of stuff up...
> That's what testing and reviewers are for. Why didn't the review process then find the incorrect code in my patch v2? + if (foundWrapPosition && 0 == wrapPosition) { + foundWrapPosition == PR_FALSE; + } In the end it's the coder who is responsible for code correctness. The compiler found that for us. > content code (and its owners/peers) > generally prefers the variable on the left hand side of a comparison. > And more to the point, we hate seeing style clashes in our files. I don't get the point how safe code can be assumed as being incorrect style. > People (such as jst) > run through the content/ code randomly to clean this sort of stuff up... I already have r=jst ;-)
Comment on attachment 123398 [details] [diff] [review] Patch v5 >Index: mozilla/content/base/src/nsHTMLContentSerializer.cpp >=================================================================== >@@ -78,12 +83,14 @@ > } > > nsHTMLContentSerializer::nsHTMLContentSerializer() >+:mIndent(0) >+,mColPos(0) >+,mInBody(PR_FALSE) >+,mAddSpace(PR_FALSE) >+,mMayIgnoreLineBreakSequence(PR_FALSE) >+,mInCDATA(PR_FALSE) >+,mNeedLineBreaker(PR_TRUE) Operators need to be add end of line and add some spaces: : mIndent(0), mColPos(0), etc, >+ do { >+ PRUnichar c = *aPos; >+ if (' ' == c || '\t' == c || '\n' == c) { I still don't understand why you use that local var though (here and further below).
Attachment #123398 - Flags: superreview?(peterv) → superreview+
Attached patch Patch v6 (obsolete) (deleted) — Splinter Review
Attachment #123398 - Attachment is obsolete: true
Attachment #123500 - Flags: superreview+
Attachment #123500 - Flags: review+
Attachment #123500 - Flags: approval1.4?
Attachment #123500 - Flags: superreview+
Attachment #123500 - Flags: review+
Attachment #123500 - Flags: approval1.4?
Attached patch Patch v7 (deleted) — Splinter Review
Addressed all comments.
Attachment #123500 - Attachment is obsolete: true
Attachment #123504 - Flags: superreview+
Attachment #123504 - Flags: review+
Attachment #123504 - Flags: approval1.4?
> I don't get the point how safe code can be assumed as being incorrect style. It can be hard to read. Furthermore, gcc warns about unparenthesized assignment in if, while, etc., conditionals.
peterv, re: comment 72: I use local vars to "common" dereferenced pointers, even if a decent compiler can do it for me, just for brevity's sake, and in case memory ambiguity (possibility of aliasing, or side effects in functions called) would make for a needless reload (not the case here, but the pattern's the thing). /be
Comment on attachment 123504 [details] [diff] [review] Patch v7 This looks like the right thing to do so we're approving it for 1.4. But it also looks large so please keep a close eye on nightly build and 1.4 RC1 feedback for any regressions.
Attachment #123504 - Flags: approval1.4? → approval1.4+
Brendan: I stand corrected, I'll keep that nit to myself in the future :-). I find it confusing because the value of the dereferenced pointer isn't changing but with the local var there's two values to keep track of.
adt: nsbeta1-
Keywords: nsbeta1nsbeta1-
Did this land yet? I don't see anything in bonsai.
Whiteboard: edt_x3 → edt_x3 landed1.4?
I'll check it in today.
fixed on trunk
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Whiteboard: edt_x3 landed1.4? → edt_x3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: