mailstore conversion doesn't preserve character encodings
Categories
(Thunderbird :: General, defect)
Tracking
(Not tracked)
People
(Reporter: benc, Assigned: benc)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 5 obsolete files)
(deleted),
application/x-javascript
|
Details | |
(deleted),
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
Updated•6 years ago
|
Updated•5 years ago
|
Comment 2•5 years ago
|
||
"opened 11 months ago". That's why people produce duplicates: https://bugzilla.mozilla.org/show_bug.cgi?id=1593455
Where is the problem to get rid of that bug? Converting mbox to maildir is just cutting the mbox text file into smaller pieces of text. Just do not touch the umlauts.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
|
||
Notes to self:
The issue is that we're treating the mbox as a text file. It's not really - it's a mashup of text files, all with potentially different encodings. It needs to be treated as binary.
The complication is that the converter code uses regexp to implement the (rather hairy) rules for detecting the separator lines ("From ..."). We have to cope with all kinds of different formats, and also have to heuristics to cope with mbox files which don't properly escape lines in the message body which begin with "From " so we don't accidentally cut messages into bits.
So, we need to ditch the TextEncoder/TextDecoder and deal with the raw bytes directly. This could still involve converting it into a string to use the existing regexp, but that's a bit icky and doubles the memory requirements (stupid ucs-2 javascript strings and their 16bit chars). So if we can efficiently scan the raw Uint8Array chunks we're already reading in that'd be much nicer.
Longer term:
- We should probably aspire to using the mbox format standardised in RFC4155. But we're quite a way from that (for one, it specifies a 7bit-clean stream, so we'd have to parse headers and re-encode appropriately). ie Be forgiving about the mbox formats we can read in, but be really strict about the one we write out.
- The converter should use the same mbox code as the C++ messagestore. We don't at the moment, due to concerns about single threadedness and freezing the GUI which the conversion runs, but we should figure out how to resolve this and unify the code.
At the moment thunderbird accepts every line starting with "From " (regardless of anything that follows in that line) as an envelope line. The elaborate regexp used for the conversion does already miss some (admittedly very old) envelope styles present in some of my old mboxes. So indeed it might not have to be so complicated. If the conversion uses
let sepRE =/^(?:From .*\r?\n)[\x21-\x7E]+:/gm;
that would be (almost) equivalent to what nsParseMailMessageState::IsEnvelopeLine() in comm/mailnews/local/src/nsParseMailbox.cpp detects. Could then the whole decode/encode cycle be ditched?
Assignee | ||
Comment 5•5 years ago
|
||
Work-in-progress patch which should sort things out. Works for me, although I want to do some stress-testing and I need to do a pass through to clear out the debug output cruft and tidy the code up.
The approach I took is to treat all the incoming data as binary rather than (insanely) trying to decode it as text. It means regexps can't be used, and the alternative code is a little clunky, but that's what you get for doing byte-wrangling in Javascript :-)
(In reply to Cluster15 from comment #4)
let sepRE =/^(?:From .*\r?\n)[\x21-\x7E]+:/gm;
That's more or less what I ended up going for.
A "From " at the beginning of a line, ignoring the rest of the line, as long as the next line looks like it's a header field.
Assignee | ||
Comment 6•5 years ago
|
||
A tidied up/tweaked version of the patch.
Assignee | ||
Comment 7•5 years ago
|
||
Just some notes for future reference:
The original version of the mbox->maildir conversion used a utf-8 decoder upon the incoming mbox data to get it into a javascript string. It would then search for "From " separator lines, break up the string and pass it back through a UTF-8 decoder as it was written out to individual files in the maildir.
With a 7-bit-clean mbox file (which I think most mbox variants are expected to be, using base64 encoding or whatever when contents are 8bit) this does no harm. 7 bit ASCII maps directly to UTF-8 and back again with no loss. However, out in the real world, we've got mbox files which have message containing all sorts of non-7-bit encodings.
Especially when the obvious maildir->mbox conversion method is to concatenate files, adding "From " lines between each one.
It's not guaranteed that such 8-bit messages in an mbox would survive the UTF-8 round trip intact.
So, the safe thing is to treat the mbox files as binary, and preserve everything between "From " separators as is.
Comment 8•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #8)
I'm not sure I still understand why the earlier thing wasn't working
properly for all cases. Even if we're encoding to UTF-8 and it's not, we
only care about finding the From parts which are always ascii and thus utf-8
compatible.
We still need to get the parts in between the "From " separator lines as 8bit bytes, and those chunks won't survive the UTF-8 round trip (some byte sequences are invalid UTF-8, so will be corrupted).
Anyway, while it seems to "work", the performance is really abysmal.
Status quo:; 1G mbox -> maildir took 18s.
With the patch the same operation takes around 10 minutes.
Ouch! Yep, that's unacceptable. Will investigate and try some alternatives.
Assignee | ||
Comment 10•5 years ago
|
||
Lots of faffing about, but here's a different approach.
It's a simpler patch which ditches the utf-8 TextEncoder
/TextDecoder
pair (which was corrupting the data, because it's not utf-8). Instead, it uses a pair of functions which just map bytes directly to and from characters in a javascript string (so only characters 0-255 appear in the string form).
It's still a little sucky because we're using twice the memory we really need to.
But on the plus side:
- we can still use regexps to look for "From " separators (which makes for faster searching manually looking through Uint8Arrays)
- JS strings have all kinds of magic optimisations to speed things up.
This patch works on the previously-failing unit tests, and seems OK-but-not-great on my 150MB mbox example: 10 secs or so (I need to synthesize up some multi-GB mbox tests). But this approach should be more amenable to speeding up than my previous patch.
Geoff, a couple of JS questions:
- Is there a "proper" way to do some performance timers in JS? I want to wrap various bits sections of my code up to accumulate timings to see where the time is being spent. The MDN docs seem to suggest the JS performance timers are hobbled for security reasons...
- In the attached patch, I suspect
stringToBytes
andbytesToString
are the choke points. I've left them both pretty naive as I know JS optimisation isn't always intuitive. Can you see any obvious ways to speed them up? (I did trylet str = String.fromCharCode(...rawBytes)
, but of course hit some upper limit on param count. Next step I guess would be to try this in batches, but seems silly without code to measure it).
Comment 11•5 years ago
|
||
Before I look at this more closely, you do know that TextDecoder/TextEncoder doesn't have to be UTF-8, right?
Comment 12•5 years ago
|
||
The problem is, that berkeley-style mboxes can contain a mixture of valid 8-bit encoded mails (and of course any 7bit encoded as well). One encoding - regardless whichever you choose - will simply not fit in those cases. I have hundreds of 8-bit mails in mboxes where ASCII 7-bit encoded, 8-bit- encoded ISO-8859-1(5)? and later on utf-8 8-bit encoded mails are happily glued together. Unless such a case is captured by the
'x-user-defined' encoding mentioned (but not documented) near the end of the encoding list, any text encoding with just one encoding will fail whenever there is an encoding mismatch.
I might add that while performance maybe an issue, in a way I do think a couple of minutes (up to 20min) is easily tolerable. It's not something you will do on a daily basis. And I personally prefer correctness of code before performance. Tell the user how long it might take and that this is a one-time action and it will be fine.
Assignee | ||
Comment 13•5 years ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #11)
Before I look at this more closely, you do know that TextDecoder/TextEncoder doesn't have to be UTF-8, right?
Yes, but I didn't see anything which I was confident would map bytes unmolested to UTF-16 0-255. I'd love to be proved wrong here ;-)
Comment 14•5 years ago
|
||
(In reply to Ben Campbell from comment #10)
Geoff, a couple of JS questions:
- Is there a "proper" way to do some performance timers in JS? I want to wrap various bits sections of my code up to accumulate timings to see where the time is being spent. The MDN docs seem to suggest the JS performance timers are hobbled for security reasons...
The imprecision might not be there for chrome code but I'm not sure. It might pay to ask somewhere like mozilla.dev.platform.
- In the attached patch, I suspect
stringToBytes
andbytesToString
are the choke points. I've left them both pretty naive as I know JS optimisation isn't always intuitive. Can you see any obvious ways to speed them up? (I did trylet str = String.fromCharCode(...rawBytes)
, but of course hit some upper limit on param count. Next step I guess would be to try this in batches, but seems silly without code to measure it).
I suspect you'll get better performance going through the Blob class, but it's only a suspicion:
// string to array buffer
let buffer = await new Blob([string]).arrayBuffer();
let array = new Uint8Array(buffer);
// back again
let reader = new FileReader();
reader.addEventListener("loadend", () => {
string = reader.result;
});
reader.readAsBinaryString(new Blob([array]));
Assignee | ||
Comment 15•5 years ago
|
||
I've been doing some timings with my patch, and it's about half the speed of the original. Which isn't toooo bad considering it's doing the binary->text decoding in naive loop-over-every-byte javascript rather than native C++ (or Rust - I forget which).
I'm sure we can shave off some time by being a little more clever (eg Geoff's Blob suggestion, although we're in a worker thread, so there might be complications there...).
On my machine, it converts a faked-up 1.2GB mailbox with 1 million small text-only emails in about 5 minutes (vs 2:30 with the old code).
It seems to scale linearly and I didn't spot any weird runaway memory usage (can never quite tell what the garbage collection will do ;- ).
So the next step would be to try some simple-as-possble approaches to shave some time off, then call it done.
However.
While I was running my tests, I noticed that the file counts in the resulting maildir were way lower that they should be - looks like 30-40% loss. I happens both with and without my patch, so I think the problem was already there, we just never noticed it before. The minimal test cases use a small number of emails (like 20-30, not thousands).
I'll look into it tomorrow and try and see what's happening. Probably tackle it in a new bug. Sigh.
Assignee | ||
Comment 16•5 years ago
|
||
Just uploading my hacky little speed-test script for future reference (I did it as a unit test to make it simple to run).
Assignee | ||
Comment 17•5 years ago
|
||
The missing-emails issue I stumbled upon yesterday turned out to be the separator line parsing being too strict. I'll fix that in another patch, but for now, here's a revised version of the main patch to fix the encodings.
To recap, the conversion is about half the speed as the old code (which erroneously used the native TextEncoder/TextDecoder, screwing up some content). For a one-off operation, I can live with that, but happy to look at it afterwards (in a followup bug maybe?). But for now I'd like to be moving on to other stuff.
To be honest, I'd ideally like to ditch this code and do such conversions using the C++ pluggable mbox/maildir implementations rather than maintain these (but there are a few threading & gui issues that'd need to be worked through, so not trivial). I'm pretty sure going that route would be a lot quicker (it should be an IO bound operation).
Magnus: you're my default reviewer on this patch, but an r? flag will have to wait until your bugzilla account is sorted :-)
Assignee | ||
Comment 18•5 years ago
|
||
And this one loosens the overly-rigid "From " matching, as per Comment 4.
This was the problem I encountered in Comment 15. It was tripping up on my faked-up mbox data, which had slight date-formatting differences to what was expected (spaces vs leading zeros for single-digit days).
(The overly-rigid regexp was used to distinguish real "From " lines from non-escaped "From " lines in the message body, but since then the much better check-for-a-mail-header-on-the-next-line hack was added).
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 19•5 years ago
|
||
Comment 20•5 years ago
|
||
Assignee | ||
Comment 21•5 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #20)
- let bytesToString = function(bytes) {
Maybe it should just be
bytes.map(b => String.fromCharCode(b)).join("")
No, it doesn't work. map()
returns another UInt8Array, so the join() doesn't work properly (you just end up with a rubbish string of digits). I did change bytesToString
to accumulate the string using reduce
, as in:
return bytes.reduce(function(str, b) { return str + String.fromCharCode(b); }, "");
that works, but isn't any faster.
- // We know that str is really raw 8-bit data, not UTF-16. So we can
Hmm, what does happen with UTF-16? Test case at
https://searchfox.org/comm-central/source/mail/test/browser/composition/data/
body-utf16.eml - maybe you can add that to the tests?
No, it really is raw 8-bit data. The whole point is to treat all incoming data as bytes, regardless of encoding, and write it back out verbatim (and split by the "From " lines). The only reason we want it in string form is to run the "From " regex, which is waaay quicker than searching the raw bytes.
Assignee | ||
Comment 22•5 years ago
|
||
And this update includes the fix for the broken unit test.
I'd screwed up the regexp. Doh.
Comment 23•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 24•5 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/b8cc5965aca0
Make mbox->maildir conversion 8-bit safe. r=mkmelin
https://hg.mozilla.org/comm-central/rev/4a0bd51ee9c0
Loosen restrictive "From "-line matching during mbox->maildir conversions. r=mkmelin
Description
•