Closed Bug 954168 Opened 11 years ago Closed 11 years ago

The Forest database proposal

Categories

(Instantbird Graveyard :: Other, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bugzilla, Unassigned)

References

Details

Attachments

(6 files)

*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-03-21 14:36:00 UTC *** This post provides an overview of the overall database design proposal of the tentatively dubbed Forest system. Forest[2] is designed to be a complete replacement for the Instanbird chat logs and history systems using Storage[1] ib_forest * id * protocol_id * preview_text * chat_count * user_id * avatar_id * frequency ib_threads * id * protocol_id * user_id * avatar_id * dateAdded * lastModified ib_posts * id * protocol_id * user_id * avatar_id * post * tag_id * favourite ib_protocols * id * name ib_users * id * name ib_avatars * id * avatardata ib_tags * id * tag ib_chathistory * id * protocol_id * timestamp * session ib_alerts * id * type * timestamp * read [1] https://developer.mozilla.org/en/Storage [2] The named Forest is proposed for the simple fact that it suites the program given it's name. Being the core of the back-end logging functionality, it's not strange to refer to the database as the program's home. And with it's ability to extensibility, it's more than what a nest is, hence opting for a more encompassing Forest.
*** Original post on bio 734 at 2011-03-21 14:44:57 UTC *** Could you provide a brief description of what each of these tables are and what each column refers to? Not all of them are clear from the column identifier. Also some things are already stored somewhere in Instantbird (either in the prefs are in a SQL database), so is it necessary to store them again? > ib_protocols > * id > * name > > ib_users > * id > * name > > ib_avatars > * id > * avatardata > > ib_tags > * id > * tag
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-03-21 15:17:09 UTC *** (In reply to comment #1) > Could you provide a brief description of what each of these tables are and what > each column refers to? Not all of them are clear from the column identifier. ib_forest is essentially the core table of the database. In this table we essentially store all of the information we're likely to want at a glance. * id | every record needs a unique identifier. * protocol_id | what protocol was used at the creation of the record. * preview_text | preview text is a column that's updated based on the last . thing that said within a conversation. It enables the ability to scan through a chat log. * chat_count | how many conversations you'd have with said person. * user_id | the userid of the person you're talking to. * avatar_id | allows the quick pulling of an avatar, negating the dependency on an online connection. * frequency | the frequency of the conversation. This is a calculation that's done at the time each record is stored/updated based on other tables. ib_threads is where we store the main threads and information specific to them. * id * protocol_id * user_id * avatar_id | this column would be updated at the end of every conversation. * dateAdded | this is the creation date of the table. * lastModified | this is the last update date of the table. ib_posts is the actual data from the conversations, this is essentially where every conversation is stored. the conversation is broken down either match open/closed windows or long lulls in conversation (2hrs without anyone speaking). * id * protocol_id * user_id * avatar_id * post | is the actual text of the conversation. * tag_id | this would enable users to tag conversations * favourite | this would enable users to mark a conversation as a favourite. ib_protocols is the storage of protocols. It's biggest benefit is that you enable a user to search for a conversation they had on MSN and greatly reduce the impact of the search * id * name | name of the network/protocol ib_users is where the usernames of people the user has spoken to is stored. This especially gets it's own table as a user may change their username * id | unique id to track the id number * userid | to track user number * name | to track the current name ib_avatars stores the data of a picture used by an avatar. * id * avatardata ib_tags enables the quick scanning of tags * id * tag | text ib_chathistory helps to sync the frequency of the chats. When used with the threads table, can provide frequency in the least expensive manner. * id * protocol_id * timestamp * session ib_alerts tracks alerts that have been sent to the user. This enables a user to track alerts too. * id * type | this has been designed to act like a switch, but it may be better to reference another table via a typeid. * timestamp * read | whether the alert has been marked as read or not. > Also some things are already stored somewhere in Instantbird (either in the > prefs are in a SQL database), so is it necessary to store them again? Fundamentally yes. We need to keep everything in one place to allow the maximal speed of queries. If we're linking across databases it does not work out well for us. Especially for things like userids which differ across protocols. We will of course have to make sure that the integration is seamless and dependable. But we certainly need the information local to the database to minimise the strain on the database.
Blocks: 954170
Attached image Proposed Table Relationship Overview (deleted) —
*** Original post on bio 734 as attmnt 608 by nexxuscommand AT yahoo.com at 2011-05-05 00:08:00 UTC *** If I am understanding the relationship correctly, there might be some more optimizations in the normalization process and making it more flexible. Case in point I would not make a distinction between an alert and a normal post. All an alert is, is a post by the system (which should be user id 0)to the end user. It does not needs its own separate section. I will need to do some research on all the variables, but this is a good start.
Attached image Current 0.3a2 schema overview (deleted) —
*** Original post on bio 734 as attmnt 609 by nexxuscommand AT yahoo.com at 2011-05-05 00:59:00 UTC was without comment, so any subsequent comment numbers will be shifted ***
*** Original post on bio 734 by nexxuscommand AT yahoo.com at 2011-05-06 18:17:26 UTC *** After looking over the required fields, I think I have an idea on all that's required. However the DB structure can be a little tricky. We can handle this either programmatically, relying on the program to handle errors and or conflicts, or structurally I can make it impossible create a conflict. The main difference is simply in the number of tables required to impose referential integrity i.e. the elimination of many to many relationships. Also for future proofing we should reserve the first 1000 fields so any auto numeration starts at 1000 and then adds.
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-05-06 18:34:16 UTC *** I think it's important to bare in mind that we need extensibility and stability. We ideally want to learn from the mistakes and changes made on the places database rather than simply repeating their mistakes. This will ultimately be the core of the program and thus we need to separate the data as best as possible. Regarding the separation of alerts versus treating them as posts, I'm firmly behind that idea given the sheer versatility of alerts. Loading up a table that has so many rows just to provide alerts at a glance will hinder performance.
*** Original post on bio 734 by nexxuscommand AT yahoo.com at 2011-05-06 20:25:45 UTC *** Extensibility and stability come from the normalization of the data and the structure the database. I have to disagree with you on the alerts; there is no technical difference between a notification and message. To keep my design it simple I have a session table that has an ID, start and stop times. Then have an enforced referential check between that and the messages which again is kept simple by only including the auto_interger, account_id of sender/receiver, the message, timestamp, timeoffset, and session_id. Returning all messages by session and then the order is was stored in the database. SELECT * FROM ib_messages ORDER BY session_id DESC, auto_interger DESC For just a particular thread with all users. SELECT * FROM ib_messages WHERE session_id = 1 ORDER BY auto_interger DESC To show a thread and exclude message/user would be SELECT * FROM ib_messages WHERE session_id = 1 AND account_id IS NOT 1 ORDER BY auto_interger DESC All of which is very simple SQL logic. You want to count all the posts in a thread you don't need to store that number either SELECT COUNT(*) AS count FROM ib_messages WHERE session_id = 1 AND account_id IS NOT 1 ORDER BY auto_interger DESC Returns the number posts, this is in no way taxing on a SQL system. Ironically I would possibility break out the alerts for a completely different reason, manly the alerts themselves can be normalized to remove redundancy then again to get those back into a message you would need to do a JOIN statement As long as the data is normalized it can always be shifted at a later date, it's just a lot of the stored procedures would need to be reworked. If I have time this week I will build a test DB and put data in it to test.
Attached image Schema on the instantbird.sqlite file (deleted) —
*** Original post on bio 734 as attmnt 612 by nexxuscommand AT yahoo.com at 2011-05-07 04:54:00 UTC *** 608: Proposed Table Relationship Overview should be marked obsolete if the new data structure is accepted.
Attached file empty database (deleted) —
*** Original post on bio 734 as attmnt 613 by nexxuscommand AT yahoo.com at 2011-05-07 04:59:00 UTC *** Quick overview; ib_accounts = stores all basic and common account information, it is the primary table that all the other tables revolve around. ib_messages = all messages are stored in this simple table. This is also the only table that is fully indexed for searches. ib_session = keeps track of what messages belong to which sessions. ib_accounts_groups = required to allow multiple accounts to be grouped into a single object. ib_groups = parent container of accounts. ib_subgroups = needed to allow for subgrouping. Ib_prpl = libpurple Ib_icons = information for icons. Ib_status = status of account. Ib_sounds = sounds of status changes. I know this does not have all the fields, but it is a start. Off the top of my head there may be other links between ib_groups and ib_icons so groups can have predefined icons and also ib_groups and ib_status can be lined so the entire groups status can be changed. For file downloads, we should just use the downloads.sqlite for the Mozilla project, no sense in reinventing the wheel. The biggest obstacle is to come up with and elegant method to sync two or more db's which would resolve bug 953505 (bio 54)
Comment on attachment 8352356 [details] empty database *** Original change on bio 734 attmnt 613 by nexxuscommand AT yahoo.com at 2011-05-07 05:00:53 UTC was without comment, so any subsequent comment numbers will be shifted ***
Attachment #8352356 - Attachment description: attachable but empty db file → empty database
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-05-08 10:14:13 UTC *** I just can't help but feel there's a lack of normalisation here. That is in fact why all forum software have a thread table and a post table. This is basic database design, thus saying we should poll our biggest table for alerts is nonsensical in the grand-scheme of things. Same goes for messages. Also ib_prpl needs to be renamed, it can either be purple or ib_purple but attempting to shorten it here makes absolutely no sense.
*** Original post on bio 734 by nexxuscommand AT yahoo.com at 2011-05-08 17:02:08 UTC *** That's why this is all here, for peer review. I had a discussion with Flo on the IRC and because the it looks like the goal is to keep the logs per account separate for each other. LogDB example + ib_accounts = (stripped down version with just the ID, account(key), alias_local, would be in it + ib_message = keep the same fields + ib_session = keep the same fields Largely our designs are the same. The only reason why I keep going back to the ib_accounts table is because form my POV there is no technical distinction between a buddy and an account or even a blocked buddy, it's just a status change and in the case of an account its some extra fields not applicable to the generic buddy account info. Also for clarification on the group table and why it exists. The groups tab exists to create a logical relationship between accounts. The reason why this works is because there is no technical difference between say the general group of everyone, and a group of select accounts regardless of protocol. All it is in the end is a container for accounts. The subgroup would allow for nested grouping.
Attached file Quick and dirty messagelog (deleted) —
*** Original post on bio 734 as attmnt 615 by nexxuscommand AT yahoo.com at 2011-05-08 17:07:00 UTC *** After my discussion with Flo here is the quick and dirty message log design with populated data for the last three days of the IRC room. Note that this is independent of the blist/accounts but does copy over just the required accounts for the logs based upon the "clean" name of the account.
Attached image MessageLog Schema (deleted) —
*** Original post on bio 734 as attmnt 616 by nexxuscommand AT yahoo.com at 2011-05-09 07:06:00 UTC *** As an ongoing mental exercise to flush out any database changes before any potential implementation, I have a general question, regarding tags, favorites and unread messages. Is there any reason why they require different fields and cannot be defined under one general field? IMO it is nothing more than a reserved name for a standard action. Attached is an updated copy the messagelog schema. As always ask if there are any questions/comments.
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-05-09 09:37:55 UTC *** (In reply to comment #13) > Created an attachment (id=616) [details] > MessageLog Schema > > As an ongoing mental exercise to flush out any database changes before any > potential implementation, I have a general question, regarding tags, favorites > and unread messages. Is there any reason why they require different fields and > cannot be defined under one general field? IMO it is nothing more than a > reserved name for a standard action. > > Attached is an updated copy the messagelog schema. As always ask if there are > any questions/comments. Hehe, you're like anti-normalisation. That said, I thought about this some more last night, if we're hoping to sync logs (and potentially conversations) across platforms. We're going to have to really think on the largest scale and thus maximise normalisation here. We ultimately want to make sure that data can be broken up as much as possible and then synchronised everything. I would certainly hope that before any decision is taken on what direction to go with schema everyone involved in the decision and putting forward proposals has taken time out to read the information available for the Mozilla Places DB and Large scale DB Design in general.
*** Original post on bio 734 at 2011-05-09 11:20:14 UTC *** (In reply to comment #14) > I would certainly hope that before any decision is taken on what direction to > go with schema everyone involved in the decision and putting forward proposals > has taken time out to read the information available for the Mozilla Places DB If you are referring to a specific document here, please link to it :). I think the various points we made while discussing on IRC the indexed logs Google Summer of Code proposal would be valuable in this context. I'm not sure if it's been summarized somewhere or if all we have is the IRC log.
*** Original post on bio 734 by nexxuscommand AT yahoo.com at 2011-05-09 16:20:33 UTC *** (In reply to comment #14) > Hehe, you're like anti-normalisation. > > That said, I thought about this some more last night, if we're hoping to sync > logs (and potentially conversations) across platforms. We're going to have to > really think on the largest scale and thus maximise normalisation here. We > ultimately want to make sure that data can be broken up as much as possible and > then synchronised everything. I see a lot of talk but not a lot of action. If you have a better schema perhaps a visual or some more input other then snide remarks about others. I was instructed to make the logs independent of the account data. As such there needs to be some soft of common factor between two independent database that have some sort of common key, and not an arbitrary id. hence the denormalisation of that part of the database. We could normalize the individual messages, sure, unfortunately as the DB grows the amount of time to enter the data grows and takes longer to do the insert check. At some point you will waste more time checking a 100,000 line message for duplicate text then just entering the data and potentialy wasting space. How about normalizing the time offset. We could do that that was well, but it turns out that in sqlite the data field between auto numerate integer ID and an integer reserves the same amount of space, ironically the date field take up more space which is why in the Mozilla tables they stick with integers as well. I was a DBA for two years and being told that I am anti-normalisation is the first in fact quite the contrary in fact because that was among the things that I was told by my peers and seniors that I excelled at. We are all waiting for your work to help illuminant us. BTW a GUID per installation would fix the sync issues and that be in its own table.
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-05-09 17:01:48 UTC *** Snide remarks? You mistake my intentions. I'm merely taking a back seat here attempting to critique your choices as to better them. Yes I do have some paltry db design experience and may have actually done a wee bit of a research before making my initial proposal, but ultimately I have enough humility to attempt to guide your overly simplistic choices to something more appropriate and scalable. I apologise if that offends you. However I refuse to apologise for my assumption that you did indeed research or at least would upon prompting read the MDN's page on the places database and the related blogs that populated Planet Mozilla while the design was undergoing an overhaul.
*** Original post on bio 734 at 2011-05-09 17:02:42 UTC *** I've been reading through this bug (and just did so again), but I'm not sure of what it is fully trying to accomplish. There's talk of a DB schema to store "stuff" in, and a lot of discussion has gone into it. But what issue/problem/bug/enhancement is this trying to make? What is going to be the "feature" that will leverage all of this dicussion? I do not see one, and perhaps there is one, but it just hasn't fully been shown. If there is not one this should be RESOLVED INVALID. I think this is attempting to work on the "indexed logs" idea, but I'm not really sure of that. If it is...a list of requirements for that feature need to be discussed before discussion of the implementation.
*** Original post on bio 734 at 2011-05-09 17:04:41 UTC *** The IRC discussion I was referring to in comment 15 seems to start at http://log.bezut.info/instantbird/110404#m192 A very long while ago, we discussed log format and the notes we took are at https://wiki.instantbird.org/Brainstorm:logs (some of the requirements listed there are incompatible, so it's OK to ignore some like for example 'Easy to use with basic tools (grep, etc...)'). I'm extremely confused by the discussion in this bug, to the point that I don't even know what to ask more info about. I feel that valuable energy is being wasted here and there's an increasing likeliness of someone being seriously disappointed at some point. I think it would help if everybody involved here could explain what's the desired outcome of the discussed changes. What's the issue you are trying to solve exactly? Personally, I don't remember expressing an intention to attempt to put 'everything into a database'. The issues I'm trying to solve are: - our current log format (Pidgin's plain text files) looses a lot of data, to the point that we can't redisplay a conversation from the log. - I would like to be able to search for a string across the whole archives in almost constant time (so I want to find a way to index the logs). - I would like to be able to synchronize and merge the data across several instances of Instantbird, and with a remote server. - I would like to reuse as much as possible of the existing software from Mozilla (Sync, Gloda, ...). I hope this helps. If it doesn't, please tell me how I can be more helpful here!
*** Original post on bio 734 by Paul [sabret00the] <sabret00the AT yahoo.co.uk> at 2011-05-09 21:52:35 UTC *** First of all, please let me state for the record. I put forward a proposal and as such am very interested in seeing a proposal come to fruition. I have in no way ever assumed that by doing so it would mirror my proposal or if it was to even come to fruition at all. As is clearly visible, I made the proposal two months ago and assumed when flo had some time he'd either mark as NEW or WONTFIX. While I believe in the fundamental importance of the proposal I'm certainly not of the opinion that if this doesn't happen I'll be angered in anyway. I took a liking to a project and as such have aspirations for the project. I'm very vocal in my belief that Instantbird has the potential to be the best multi-protocol client around, bettering both Pidgin and Trillian who I see as direct competition. (In reply to comment #18) > what it is fully trying to accomplish. There's talk of a DB schema to store > "stuff" in, and a lot of discussion has gone into it. But what > issue/problem/bug/enhancement is this trying to make? What is going to be the > "feature" that will leverage all of this dicussion? Features this bug lays the foundation to provide: * Encrypted storage of logs, alerts and accounts * Searchable logs and alerts * Synchronisation of logs, alerts and accounts (In reply to comment #19) > Personally, I don't remember expressing an intention to attempt to put > 'everything into a database'. The issues I'm trying to solve are: You indeed did not express such an intention. However having taken an overview of the project and the things that are likely to change or could potentially change, like switching from libpurple to pure javascript. I made a proposal that I felt held the capacity to absorb any such changes. > - our current log format (Pidgin's plain text files) looses a lot of data, to > the point that we can't redisplay a conversation from the log. > - I would like to be able to search for a string across the whole archives in > almost constant time (so I want to find a way to index the logs). > - I would like to be able to synchronize and merge the data across several > instances of Instantbird, and with a remote server. > - I would like to reuse as much as possible of the existing software from > Mozilla (Sync, Gloda, ...). All things I took into consideration explicitly. Just to reiterate again, this is a project a greatly believe in, but I'm of no belief that everything I want for Instantbird should explicitly happen or will happen. I merely put forward ideas as bugs and see where it goes from there.
*** Original post on bio 734 at 2011-07-08 14:18:01 UTC *** There is no clearly identified issue that this bug is trying to solve. If there is one that we cannot see, please file a new bug about the issue, then a proposed solution.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: