1572000 - [meta] database backed global message index

Assignee

Description

•

5 years ago

Currently folder indexing uses Mork, and message index is per folder. Mork removal has been wanted for a long time (bug 453975), but when working on removing it from usage for the folder index we should also take the opportunity to design it correctly to use a global message index.

This would enable a conversation view of messages (which currently requires gloda, and gloda is not meant for that).

It would make issues like bug 43278 go away.

I belive for Gmail, we're downloading the same messages multiple times, because we don't know it's already in All Mail (duh!).

We need to figure out how storage of the actual message data should be handled: put it in the database, or keep it on the file system, or a combination where normalized/decoded content would be put into the database for quick searching and indexing and the raw data would be kept only for backup.

Primarily I think we should target IndexedDB for database, since that is the web thing to do.

Magnus Melin [:mkmelin]

Assignee

Updated

•

5 years ago

Blocks: 453975

Magnus Melin [:mkmelin]

Assignee

Updated

•

5 years ago

Priority: -- → P3

Jorg K (CEST = GMT+2)

Comment 1

•

5 years ago

Any reading material on the IndexedDB?

Magnus Melin [:mkmelin]

Assignee

Comment 2

•

5 years ago

https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API is one. If you want the spec, see http://w3c.github.io/IndexedDB/

In a nutshell, if you need a database and want to be using web technologies, this is what you need to use. Amongst the more important features, it can be used from Web Workers, so for instance you can have background workers fetching your mail from the server and putting them into the database, with no need to steal processing time from the UI thread. For all other solutions (like sqlite) you'd have to jump through multiple hoops to get that working in any kind of hacky way (if at all).

Jorg K (CEST = GMT+2)

Comment 3

•

5 years ago

Do you plan to do a prototype to check the performance of the new solution? Mork is old and ugly, but it can handle huge folders in a reasonable time.

Magnus Melin [:mkmelin]

Assignee

Comment 4

•

5 years ago

Yes we would have to ensure the performance is acceptable.
A complete one-to-one comparison to current state will be more difficult since current state isn't dealing with global, but if you keep it one folder only for comparisons, that should work.

Wayne Mery (:wsmwk)

Updated

•

5 years ago

Depends on: 11050

Eyal Rozenberg

Comment 5

•

5 years ago

This is at the same time a good idea and a totally wrong idea.

Indeed, we do need to be able to index all messages.
And we do need to be using a Database.

But - the entire separation between "the messages", "an index for the messages" and "a backing database" with Thunderbird marrying them together or using one for the other - is myopic.

Messages should simply go in a message database. That's it. Enough said, full stop.

No more saying, "Oh, we need search ability X" or "indexing ability Y" - we just need a DBMS, a software system which has all that stuff. You put your messages in, and you're done (up to having effective access to its contents and facilities). I'm not saying existing document-oriented DBMSes have all the features we need - maybe they're missing some - but we definitely need lots and lots of what they already have.

Ben Campbell

Comment 6

•

5 years ago

I've been thinking about this quite a lot (in the context of folders and of pluggable mailstore), and here are my thoughts so far:

The very first step should be to document all the things that are stored in the messagedb.
There are a lot of protocol-specific and implementation-specific things stored in there.
For example:

mbox mailstores stash offsets to each message within the mbox file.
Maildir mailstores stash filenames.
IMAP folders stash the server-side name of the folder.
etc...

So I think the first chunk of work is to audit exactly what's going into the DB and to document it. I'm making a start on this now, noting down uses as I encounter them.

This is an issue when doing things like moving folders: eg what needs to be done to a message database when you move a folder from a local account to an IMAP account, say? Are there paths in the db that need to be patched up? What needs to be done with child folders? Can we even copied and reuse the db or does it need to be rebuilt from scratch? At the moment there's a whole heap of ad-hoc and slightly voodoo rules scattered around, and I think there are a lot of tricky little bugs stemming from this.

Beyond that...

Ideally it'd be good if we could tighten up the message-centric DB API (currently nsIMsgDatabase). It currently exposes a lot of general-purpose DB access. So it can be hard to track what stuff the various different systems are stashing in there.

I'm currently leaning toward the idea that the database should be owned by the PluggableMailstore. That's the bit which deals with the filesystem, and is already pretty co-dependent with the message database.
If there is a tighter messagedb API to be had, then it'd be nice if the boundary was at PluggableMailstore. This means we could be DB-agnostic, in the same way the messages themselves are already mailstore-agnostic.
And, as Eyal says, we gain the option to implement mailstores which use a unified database for both messages and their metadata.

Eyal Rozenberg

Comment 7

•

5 years ago

@BenCampell : I believe you are (and some of our code is) conflating a message database with a message store. You write that "the message-centric DB API" we currently use is nsIMsgDatabase. But that API (ignoring its mix of higher-level and lower-level features) uses a specific schema with specific fields or flags one can set; and allows for very few of the features of a full-fledged DB: The ability to accept and execute/apply phrases in languages for data definition, data control, data manipulation and data querying - with these languages being reasonably expressive. That does not mean the API should be textual query/command-based, but nsIMsgDatabase certainly doesn't cover it. we obviously have some implicit or explicit APIs for accessing our message "database":

Virtual search folders - materialized views. [2]
The folder tree in the 3-pane window - a kind of a nested materialized view
Message filters - stored procedures. [3]
Triggers for message filters - DB triggers [4]
A Bulk-insert mechanism - routinely downloading messages from POP3 servers (or message headers from IMAP servers) and adding them into the DB. For IMAP it's perhaps more a bulk-upsert mechanism.

... except that these mechanisms are not conceived in terms of a database.

For this reason I don't think it's important is to be DBMS agnostic [1] - the agnosticism we need is in support for different schemes of message storage (maildir, mork/mdb, IMAP). I may also need to qualify what I wrote earlier about document-oriented DBMSes: It appears most of them expect you to have contents in some format that they like, such as JSON or XML etc - while we have the complex MIME structure, with nested multiparts and alternative-parts. So it may end up being the case that - without implementing a complete DBMS - it could be better to use a relational DB which supports binary blobs, with MIME parts being such blobs; or maybe not, I'm not sure.

As for IndexedDB - I'm giving it the once-over, and it's not clear to me that it's sufficiently expressive to reduce enough of what TB is doing to operations through that interface. (But maybe it is and I just need to read through that document).

[1] - DB = the tables, the columns, their types, the constraints, and the data in the tables. DBMS = The software system for creating managing altering and querying DBs.
[2] - https://en.wikipedia.org/wiki/Materialized_view
[3] - https://en.wikipedia.org/wiki/Stored_procedure
[4] - https://en.wikipedia.org/wiki/Database_trigger

Zbigniew Gralewski

Comment 8

•

4 years ago

Guys, please move index and cache and other "not config" files OUT OF profile folder.
Please move it to something like %localappdata% etc.

Leaving only config files in profile folder would allow to synchronize whole folder in realtime using dropbox, google drive etc.
What i mean is - we want to put profile folder under dropbox or google in order to have realtime backup.

Cache, db and index files only pollute profile folder and make thunderbird totally unusable in terms of realtime profile protection.

Alfred Peters

Comment 9

•

3 years ago

Should this "Global Database" also contain the newsgroup articles? Or should these articles be stored in a separate database?

Just my (1(1¢)

Blocks: 43278

Alfred Peters

Comment 10

•

3 years ago

(In reply to Magnus Melin [:mkmelin] from comment #0)

It would make issues like bug 43278 go away.

OK, that's the answer. Never mind.

Thomas D. (:thomas8)

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1717113

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Magnus Melin [:mkmelin]

Assignee

Comment 11

•

2 years ago

I've been working on this. I'll soon upload a WIP patch. While it works, it's not really ready for too much feedback yet and many thing can and will change.

In a nutshell, the plan is to initialize the message index data from Mork (.msf) files of the folders, and store this in a database of all messages (all message metadata). After that, keep the new database up to date with what gets written to msf files. This dual-write is intended to be a temporary measure, and having the new database fully replace Mork as a second step. That second step requires a lot of work, so we can have the dual system in place soon to at least use for conversations until we're ready to fully switch over to the new database.

This enables us to display conversations stretching multiple folders, which is usually the case if you reply to something.

Assignee: nobody → mkmelin+mozilla

Status: NEW → ASSIGNED

Priority: P3 → P1

Mark Banner (:standard8)

Comment 12

•

2 years ago

Is there a design spec/document for the new database? That would help with providing feedback without everyone having to read the patch/code - and also provide documentation for future maintainers.

Magnus Melin [:mkmelin]

Assignee

Comment 13

•

2 years ago

Attached file WIP: Bug 1572000 - database backed global message index. (deleted) — Details

Magnus Melin [:mkmelin]

Assignee

Comment 14

•

2 years ago

This is storing the message meta in IndexedDB. As an object store, what is stored is following what the object looks like. Basically, the data is what we have at the moment. We store the message uri, and use that as key. Other additional things that needs to be stored are folder and root (= which account the folder is in) since e.g. if you you have correspondence between accounts those shouldn't be in the same conversation when you view it. Additionally, it does seem like we need to assign and store a threadId.

The patch has an in-memory db implementation as well, as it was more convenient to experiment with that. Switching in to the IndexedDB implementation happens by changing the store at the bottom of MessageIndex.jsm

Note that all the UI changes in the patch are only to ease debugging and try things out.

Magnus Melin [:mkmelin]

Assignee

Updated

•

2 years ago

Blocks: 1686504

Magnus Melin [:mkmelin]

Assignee

Updated

•

2 years ago

Depends on: 1635340

Mark Banner (:standard8)

Comment 15

•

2 years ago

(In reply to Magnus Melin [:mkmelin] from comment #14)

This is storing the message meta in IndexedDB. As an object store, what is stored is following what the object looks like. Basically, the data is what we have at the moment.

From what I can tell, the current data set may be too limiting and doesn't fix some of the overall issues with the existing message system.

For example, one of the current issues for add-ons is that if you want to access the headers for the messages, then you end up having to stream the whole message which, of course, is expensive when you need to do it for every message in a folder. Gloda gives you access to this in some form via its defineAttribute mechanism, that's probably not exactly what we want here, but having quick access to the headers (or maybe a defined selection) would help.

If we're not going to want to store all or some of the headers in here (which we might not want to anyway), then we either need to come up with an alternative mechanism which avoids having to stream messages every time, or link the new database to Gloda, and properly support Gloda at an add-on level.

Another thing I can't tell if you're planning on doing or not, is that I'm pretty sure Gloda has some form of "conversation detection" which fixes up some of the broken conversations that come in. Presumably that would be useful to port across to this database as well.

Gloda's message snippet may also be useful in the context of having a multiple-line message tree - as then you wouldn't need to stream each message to get the snippet (maybe the tree is already using Gloda for that though?.. but what about non-indexed folders/messages?).

I think this is what I'm currently lacking - the context around a clear description of how Gloda & the new database interact, the responsibilities of each, and how add-ons might be able to benefit from the new system (where the current Gloda system has limitations). I realise ideals might develop over time, but I think it would be better if these are being thought about up front, rather than trying to bolt things on after the main implementation is done.

Magnus Melin [:mkmelin]

Assignee

Comment 16

•

2 years ago

Thanks for the comments! I've been focused on getting the data we already have into usable form for conversational view so far, so mainly global threading. It might indeed be good to store more data about each message: potentially all of the headers in parsed format, and also a text and html representation of the message content in parsed ready-to-display format (for display and search). Of course, this part will require an initial going back and re-parsing the world to obtain that data, so super expensive. At least for the random headers, probably they should have their own store linked to the message since we don't need them usually.

Going forwards it could be possible to also add a store for actually storing the raw messages. That's not something I'm looking at atm.

Re Gloda, the gloda-id is in the .msf so that property would be available, and from that one would be able to link into Gloda when that's available, I assume? I guess there is hope to make Gloda obsolete, though that is a large undertaking in itself, a bit depending on how much much functionality would have to be 1:1.

Magnus Melin [:mkmelin]

Assignee

Updated

•

2 years ago

Depends on: 1798241

Magnus Melin [:mkmelin]

Assignee

Updated

•

2 years ago

Depends on: 1709521

Ben Campbell

Comment 17

•

2 years ago

Just a quick pie-in-the-sky thought I just had:

Currently, folders use a nsIMsgDatabase object, which is implemented as it's own, separate-per-folder database.
With a global message database, each folder is effectively presenting a view upon that global DB - just the messages which are in that folder.
We've already got an nsIMsgDBView interface, which serves a pretty similar purpose.
Would it be reasonable to ultimately aim at unifying the nsIMsgDatabase and nsIMsgDBView interfaces?
At the moment they're miles apart, but conceptually they feel like they're really doing the same kind of thing: presenting a subset of messages from a larger pool.
I don't know much about the GUI side, but I'd bet there's some hoop-jumping currently going on to deal the differences between nsIMsgDatabase and nsIMsgDBView and some major simplification to be had there somewhere...

Magnus Melin [:mkmelin]

Assignee

Updated

•

2 years ago

Depends on: 1801574

Ben Campbell

Updated

•

2 years ago

Depends on: 1802828

Ben Campbell

Updated

•

2 years ago

Depends on: 1806770

Alessandro Castellani [:aleca]

Updated

•

2 years ago

Blocks: 1819207