Closed Bug 197166 Opened 22 years ago Closed 21 years ago

[nsILineInputStream] shouldn't be allocating/using AString buffers

Categories

(Core Graveyard :: File Handling, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dwitte, Assigned: Bienvenu)

References

Details

(Keywords: fixed-aviary1.0, fixed1.7.5)

Attachments

(3 files, 3 obsolete files)

darin mentioned this a while back, when we were reviewing one of the cookie patches. nsILineInputStream::ReadLine gives back an nsAString, which is silly, because it's just reading an ACString from the file and zero-padding it to give a unicode string. So the callers are forced to use a temp AString buffer, and downconvert back to ASCII in most cases. nsILineInputStream only has a few real consumers, so should be fairly simple to fix it and update its callers.
Which way are we going to jump here? Option 1 is to make it work with ASCII only. Option 2 is to make it do conversions to Unicode from arbitrary charsets instead of what it does now. At the time when I wrote it, it didn't really matter for the use I had for it, and I did not have a very good grasp of intl issues. If I were doing it today, I think I'd be tempted by option 2....
>Which way are we going to jump here? Option 1 is to make it work with ASCII >only. Option 2 is to make it do conversions to Unicode from arbitrary charsets >instead of what it does now. When darin mentioned it, I understood he thought option 1 was fine. >At the time when I wrote it, it didn't really matter for the use I had for it, >and I did not have a very good grasp of intl issues. If I were doing it today, >I think I'd be tempted by option 2.... Of course, I have little grasp of intl issues at this point ;) I would presume all file formats across plaftorms are ASCII, in which case option 1 avoids the temp buffer copies that lineinputstream does now. if this is wrong, well...
> I would presume all file formats across plaftorms are ASCII That's the pretty questionable assumption.... If we make this assumption, we may be OK with ASCII or UTF8 or Shift_JIS files, but God forbid someone uses UTF-16.... What are the current users of this code other than mailcap stuff?
Hmm, ok, I'd better wait for someone to decide which option to take then. Currently the only other user apart from mailcap, is uriloader/exthandler/os2/nsOSHelperAppService.cpp, and soon-to-be cookie code...
Oh, one final note. the impl for nsILineInputStream currently uses nsFileInputStream::Read, which returns a char* buffer. So that would also have to be fixed to deal with arbitrary charsets? Which is a much bigger undertaking, methinks...
Well.. that char* buffer is a buffer of _bytes_, which is all nsIInputStream promises. But the basic premise of a "line" input stream is that you are providiing characters, since the concept of a "line" makes no sense if you're working with bytes. So the line input stream has to convert bytes to characters somehow; the question is exactly how. ;)
nsBookmarksService.cpp also uses this code... see my patch in bug 191783, which "decided" that nsILineInputStream was doing UTF8 conversion.
erg, thanks for the link bsmedberg. I didn't realize boris already had a "plan" for it ;) so i guess the AString outparam is kinda necessary if we want to support arbitrary charsets. So it comes back to the question of 1) putting in a quick fix that changes it to deal only with ACStrings, or 2) go and extend the iface to allow the consumer to specify the charset. If we go with 2), it'd be nice to have two ::ReadLine functions; one that takes AString and another ACString, to avoid silly up/downconverting in the latter case. bsmedberg, bz, darin, alecf: what's the vote for now? if #2, who's willing to take it on?
a) All our current consumers assume ASCII/UTF8 input. So why don't we codify that? (which is what my patch in bug 191783 does). b) If we add a CString readLine method, can it be AUTF8String? i.e. boolean readUTF8Line(out AUTF8String aLine); I don't think that this would hurt any consumers that are assuming ASCII.
Option a) sounds good to me - let's fix it up for ASCII & UTF8 only, since that's all the consumers care about at the moment. And if we go that route, we can change the iface to AUTF8String, kill the conversions, and be done with it. Adding a second AString iface can be a future exercise, if anyone actually wants it. comments?
OS: Windows 2000 → All
Yeah, I'm ok with just treating things as UTF-8 and returning nsAUTF8String for now... But document that _very_ clearly. ;)
i think this kind of low-level interface does not need to restrict the charset. it does not need to know anything about charsets. it can claim to simply read a hunk of data up to but not including the first \r, \n, or \r\n. it should be just like fgets or getline functions found in standard C/C++. as soon as you require a charset, like UTF-8, then you suddenly have to discover what the charset of the underlying stream is. there isn't any good way to do that. and once you do find out the charset, you will have to convert it from whatever charset it is to UTF-8. most likely it will not be UTF-8, since we are mostly talking about streams originating from the filesystem, and filesystems mostly don't use UTF-8 (today... tomorrow may be different of course). so, now we will have to add generic code to nsLineInputStream that converts foo-charset to UTF-8. well, i can tell you that that generally requires an intermediate conversion to UCS-2 or UTF-16. doing so is just wrong at this level. don't go there. keep it simple please!! [scriptable, uuid(... new IID required ...)] interface nsILineInputStream : nsISupports { boolean readLine(out ACString aLine); }; ACString can pass binary data. this is one of the key distinctions between ACString and the plain old |string| IDL type. Javascript/XPConnect will not have any trouble dealing with aLine containing binary crap provided the JS code is prepared to deal with such. XPConnect will simply zero-pad expand the data, which is a non-destructive conversion, and is equivalent to what we are currently providing with an AString out param. JS can do something fancy with the zero-pad expanded data if it needs to.
Yeah... The only problem there is that a UCS-2 file containing \r\n to denote EOL will look like two lines to the proposed readLine (since the bytes are 0x10 0x0 0x13 0x0). I think that's something we can live with.
the interface could support configurable line endings, but i think that might just be going off the deep end since we don't have any consumers requiring it 8^)
after this is fixed, we might want to start using it in bug 202474, where we're writing our own line-parsing code.
after this is fixed, we might want to start using it in bug 202474, where we're writing our own line-parsing code.
*** Bug 209431 has been marked as a duplicate of this bug. ***
I agree with darin... let's not make things too complicated here... Now if you insist, you could add a parameter "boolean two_byte_line_terminator" to readLine, which would look for \r\0\n\0.
after talking to Darin and Dan, taking - I need this anyway...
Assignee: dwitte → bienvenu
Attached patch work in progress (obsolete) (deleted) — Splinter Review
this seems to work. Most of it is pretty straightforward. I did have to fix nsFileStreams Seek method to clear the mLineBuffer, or else if you read some stuff, then seek, then read, you'll get some of the previous buffer. Some of the code I haven't compiled because its Linux or OS/2 or FireFox-specific. I need to clean up nsMsgBodyHandler::StripHtml, which as I have it now, manipulates the nsCString's buffer directly. this patch makes local folder searching on windows quite a bit faster. My debug build is 25% faster than my release build, so the speedup is likely more than 25%, all from caching the input stream and not using nsFileSpec's readLine.
So which of the options discussed in comment 1 through comment 13 did we go with, then?
Darin's comment 12: [scriptable, uuid(... new IID required ...)] interface nsILineInputStream : nsISupports { boolean readLine(out ACString aLine); };
OK. Could we please very clearly document in this interface exactly what it means by a "line" so that people will know, in that case? Because what it means is NOT "a line of text" but rather "a collection of bytes satisfying certain criteria". Those criteria need to be explicitly listed.
Is this accurate/sufficient? It looks like this is what NS_ReadLine does... * Read a single line from the stream, where a line is a * possibly zero length sequence of bytes terminated by * CR, LF, CRLF, LFCR, or eof. * The line terminator is not returned. * Return false for end of file, true otherwise Or should I just put it as Darin so succinctly did: read a hunk of bytes up to but not including the first \r, \n, or \r\n
David, that looks great. Thanks!
Oh, and as far as the unix and OS/2 changes go, they will need to be a lot more extensive... there are 4-5 callers of readLine() in each of those files... And we'll need to decide what to do with the bytes once we get then (converting to unicode from native is what I think we should do).
Attached patch non-mail diffs. (obsolete) (deleted) — Splinter Review
oops, thx, yes, I certainly missed a few there. This patch gets more of the readLine calls, all of them, I hope. Re how to convert, maybe I'm being naive, but the old code just did a 0 expansion of the 8 bit data, via Assign/AppendWithConversion, so if we keep doing that, we'll be working the same say the old code did. The callers can do what they want, of course. Many of them were pretty insistent on using 16 bit chars, even though the data is really 8 bit, so I didn't convert everything over to nsCStrings because the changes would ripple too much.
Attachment #146247 - Attachment is obsolete: true
I think you still missed at least two callers of readLine in nsOSHelperAppServiceUnix (and presumably the same on OS/2). Are you going to get a chance to build this on Unix before checking in, by the way?
it will be compiled on unix before being checked in, and I hope to get someone to compile it on OS/2 as well...I did miss one unix call, but I think I got all the OS/2 ones...I'll apply it against firefox as well.
I believe this should compile on linux, os/2, and with firefox...
Attachment #146289 - Attachment is obsolete: true
Comment on attachment 146304 [details] [diff] [review] fix readLine case I missed on linux, use of StringBeginsWith for linux and os/2 Darin, should anyone else look at this? Besides folks already cc'ed?
Attachment #146304 - Flags: superreview?(mscott)
Attachment #146304 - Flags: review?(darin)
nsFileInputStream::Seek(PRInt32 aWhence, PRInt64 aOffset) { + PR_FREEIF(mLineBuffer); // this invalidates the line buffer + mLineBuffer = nsnull; the setting to nsnull is not needed - PR_FREEIF will do that for you; see http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/include/prmem.h#152
oh, of course - I'll fix that and the place I copied the code from.
Comment on attachment 146304 [details] [diff] [review] fix readLine case I missed on linux, use of StringBeginsWith for linux and os/2 Note that this patch changes the behavior of nsDogbertProfileMigrator.cpp (it's going to write out a lot more data now than it did before). The change is probably correct (the old code looks totally bogus to me), but Ben should review it. For the rest, I'm just reading the OS/2 and Unix helper app service changes. >Index: uriloader/exthandler/os2/nsOSHelperAppService.cpp >@@ -559,6 +559,7 @@ > descriptionStart, descriptionEnd; > > do { >+ buf.AssignWithConversion(cBuf); (note that using the -p option to cvs diff would make this a lot easier to review). Indent is off. Also, I would much prefer |CopyASCIItoUTF16(cBuf, buf);| so that it's clear that in fact absolutely no conversion is happening. Same thing at all other points where AssignWithConversion is used. With that, the OS/2 and Unix changes helper app service changes are fine by me.
Attachment #146304 - Flags: superreview?(mscott)
Attachment #146304 - Flags: review?(darin)
Attachment #146374 - Flags: superreview?(mscott)
Attachment #146374 - Flags: review?(darin)
Comment on attachment 146374 [details] [diff] [review] use CopyAsciiToUTF16, add a missing NS_LITERAL_CSTRING, remove unneeded null assignment can nsReadCLine.h simply be cvs removed now? (i.e., there doesn't seem to be any reason to delete its contents... outright removal makes more sense, no?) + if (!utf8Buffer.Equals("#2c")) { NS_ERROR("Unexpected version header in signon file"); maybe use NS_LITERAL_CSTRING here. r=darin
Attachment #146374 - Flags: review?(darin) → review+
yeah, cvs removal of nsReadCline.h was what I had in mind after getting this all in. And I'll add the NS_LITERAL_CSTRING
Attachment #146374 - Flags: superreview?(mscott) → superreview+
Attached patch mailnews part of fix (deleted) — Splinter Review
Attachment #146556 - Flags: superreview?(mscott)
Attachment #146556 - Flags: superreview?(mscott) → superreview+
couple more directories
fixed.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
This checkin broke MIME type lookups on Linux. See bug 242495
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: