Closed Bug 383031 Opened 17 years ago Closed 13 years ago

register with VACUUM component how to shrink the urlclassifier.sqlite

Categories

(Toolkit :: Safe Browsing, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
blocking2.0 --- -

People

(Reporter: jo.hermans, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [i/o])

Attachments

(2 files, 1 obsolete file)

Today I was playing with a SQLite Database browser, to see what tables where inside the urlclassifier2.sqlite file. That browser could also perform a vacuum on the database file, and that's what I tried. To my surprise, the file shrank from 3.58MB to 1.95MB. That's quite a lot !

See also the discussion in bug 334010 (for Places), where the conclusion will probably be that a vacuum needs to be done when you clear the history. Auto-vacuum or a vacuum during shutdown might be too slow, and a history file will have roughly the same number of inserts as deletes in a single session anyway, so the extra space will be quickly reused during normally usage.

I'm not sure why the urlclassifier2 file was so large. Maybe because of what bug 346940 did with full-table updates ? See also bug 347084.

Anyway, I was wondering where we could take the opportunity to vacuum the database. Since the database-updates are completely automatic, and there's no way to clear or update the data manually, I don't see where this can be done. Maybe immediately after an update ? Or in MaybeSwapTables() ? Or maybe we can enable the auto-vacuum only for this file ?

I'm not sure how serious this is, but I think it's just good house-keeping if you do vacuum the database once in a while.
It can even be worse : I just tried to compact the urlclassifier2 file from my work-profile (which is a Firefox 2.0.0.4), and it shrank from 10.289 KB to 2.011 KB.
Yeah, because the phishing url list churns quickly (lots of inserts and deletes), it ends up being very fragmented.  I don't think this impacts performance significantly, but does cause wasted space in the sqlite file.

Finding the optimal time to vacuum sounds hard since it's a single long running (a few seconds) operation.  One idea we considered earlier was to have a separate database file for each table.  When doing a full table update, we'd create a new file with the new contents and MaybeSwapTables would then copy the new file over the old file.  Unfortunately, this wouldn't help for the incremental update case.

I'm not sure which (full table swap or incremental inserts/deletes) is responsible for most of the file bloat.

Another idea we talked about was to vacuum urlclassifier2.sqlite when clearing the cache even though it's not directly related because there's some user expectation that clearing the cache may take a few seconds.

Oh, I think bug 347084 is a lie.  At the time I didn't know about the rename table command, but sqlite does support this and we are using it (it's also fast).
tony, for bug #385834 and bug #390244, our plan is to use 

PRAGMA auto_vacuum = 2 (incremental

and call 

PRAGMA incremental_vacuum(N);

on an idle timer.

if that works out for us, that might be something worth doing for the url classifier.  (dcamp might have this same issue for the malware db?)
Now that the file name has changed, it should be 'urlclassifier3.sqlite' (note extra 'l' in current summary).
about comment #3, we are no longer doing auto_vacuum.

you don't want auto_vacuum or incremental vacuum (if disk space space is cheap), as it causes fragmentation.

see bug #385834 commetn #20

to reclaim disk space and/or defragment your sqlite database, you want a full vacuum (but it can block the ui).
but the waste of space isn't a cost problème in roming profile (under XP+2kX server). in tha case, the space alowed for profiles in 30 MB (it includes app data, desktop,etc...) and if 10 MB is just for Firefox.... it's quite too much.
On my network, 2000 users on 800 machines we have start to deply Firefox, we stop it ! many profile was over 30 because of urlclassifier2.sqlite !
Now, instead of deployint Firefox, I'm deployent a soft restriction against Firefox (based on hash) to avoid any firefox, even on USB stockage !

sorry for my poor english, but, i'm french so...
In Firefox 3, it's actually stored in Local Settings (and it's called urlclassifier3.sqlite), so it will not be accessed over the network in a roaming profile.  
So, we just have to wait a few month !
Summary: how to shrink the urlclassifier2.sqllite file ? → how to shrink the urlclassifier2.sqlite / urlclassifier3.sqlite file ?
Well, any progress about this bug?
urlclassifier3.sqlite file is part of the roaming profile for me and not in localsettings. Is there are setting which redirects this file?
The numbers in the original report seem no longer to be up to date, They have grwon significantly, which makes the issue even more important.

Yesterday I found my urlclassifier3.sqlite 51.8 MB big (this was ~6 months after the original installation, I have never touched the file) I renamed it and now 1 day later the recreated urlclassifier3.sqlite is 14.5 MB big.

Firefox 3.0.6, Linux
Firefox 3.0.10, Win XP Pro

I just checked the two profiles I use and discovered one had urlclassifier2 & urlclassifier3.sqlite files totaling 46.6 MB and 37.7 MB respectively.

After deleting the pairs of files in both profiles, only a new urlclassifier3.sqlite gets recreated and so far it's 'only' about 8.75 MB in size -- note I did not disable my "Tell me if the site I'm visiting is a suspected..." security options in either profile, so it seems there's a lot of wasted disk space related to this.

Bottom line, this should be fixed and if the urlclassifier2.sqlite file is no longer needed, it should be deleted.
Perhaps we could setup a timer to run on idle every week to do a full vacuum on the db. That might keep it from growing to large.
FWIW for whoever is ready to take the risk of phishing and attack sites in order to save space:

http://ychittaranjan.wordpress.com/2008/06/27/urlclassifier3sqlite-woes-on-firefox-3/

Do this at your own risk!
Bug 512854 recently fixed vacuum on idle for places. That patch could probably be adapted for the urlclassifier DB as well.
The initial Places vacuum patch is possibly a first run at making vacuuming sqlite a browser-wide activity. CCing marco on this bug.
I would like to make a few additional points:
FYI: I am not a developer; however some research shows:

1) You can shrink the SQLITE files using SQLITE3 (external install).  My urlclassifier3.sqlite went from 60 megs down to 20 megs.  Using the syntax:
sqlite3 urlclassifier3.sqlite VACUUM
For this to work Firefox cannot be running; because the db file is locked.
2) Using the "auto_vacuum pragma" might be work; but:

"As of version 3.4.0 (the first version that supports incremental_vacuum) this feature is still experimental. Possible future changes include enhancing incremental vacuum to do defragmentation and node repacking just as the full-blown VACUUM command does. And incremental vacuum may be promoted from a pragma to a separate SQL command, or perhaps some variation on the VACUUM command. Programmers are cautioned to not become enamored with the current syntax or functionality as it is likely to change."

http://www.sqlite.org/pragma.html#pragma_incremental_vacuum

3) This isn't only a windows specific issue.  It happens on Linux as well.  A temporary workaround can be used by creating a batch file or shell script that:

a) Detects if the Firefox process is running.
b) Check for the number of profiles
c) If not running then execute: sqlite3 urlclassifier3.sqlite VACUUM
d) Run this in the Windows scheduler or as a cron job or manually from the command line.

PS: Running sqlite3 xxx VACUUM on all files takes something like 10-16 seconds on a PIII 1 GigaHertz.
(In reply to comment #17)
> 2) Using the "auto_vacuum pragma" might be work; but:

Not just that, auto_vacuum is also known to create even more fragmentation of the database, making it slower. see above.

Yes, urlclassifier is a big db, we experimentally added vacuum for places since it is also a big db, till now i've not seen complains. We should vacuum urlclassifier as well, at least once a month.
It can be done through async storage on idle, like we do for Places. generalizing this into a separate component would be doable but each sqlite implementation can use exclusive locking and provide different ways to access the database connection, so the component would not be generic enough.
Maybe we could have a generalized mozPIDatabase that services implement providing their own connection, regardless it is exclusive or not, at that point the vacuum component could just QI that interface to get the connection. Services themselves should register with the vacuum component if they want their DB vacuumed.
This could somehow work, i'm just unsure if there would be an unacceptable overhead due to components having to register, maybe we could use a categoryCache.
The service would register in "wants-vacuum" category, when vacuum starts the vacuum component could fire a "vacuum-start" notification with the contractid of the service, the service should ensure the database won't be written while vacuum runs, till it receives "vacuum-end" notification.
filed bug 541373 against storage
Depends on: 541373
The file isn't only large. It's VERY fragmented. You may analyze fragmentation using Defraggler - possibly it will be most fragmented file on disk consisting of thousands of fragments.
On terminal server it turns into a big problem.
The simplest solution is to delete these files periodically.
Blocks: 572459
Summary: how to shrink the urlclassifier2.sqlite / urlclassifier3.sqlite file ? → register with VACUUM component how to shrink the urlclassifier.sqlite
Requesting blocking status. Url classifier is a big file that adds a lot of IO delay if it isn't VACUUMed regularly.
blocking2.0: --- → ?
Attached patch wip (obsolete) (deleted) — Splinter Review
Don't have enough of a xpcom clue to know how to register this with category manager. Perhaps someone can help me finish this.
Attached patch wip v2 (deleted) — Splinter Review
Here's the category registration and some slight changes to the first patch's use of nsCOMPtr members.
Attachment #496881 - Attachment is obsolete: true
I can review this when the patch is ready.
Will this patch defragment as well as shrink?
blocking2.0: ? → final+
Keywords: perf
Whiteboard: [i/o]
(In reply to comment #25)
> Will this patch defragment as well as shrink?

Vacuuming will create a complete new file. It is not defragmenting (which is an issue for the OS actually), you might end up with more fragments that before. Or less. But it will be smaller at least.
(In reply to comment #26)
> (In reply to comment #25)
> > Will this patch defragment as well as shrink?
> 
> Vacuuming will create a complete new file. It is not defragmenting (which is an
> issue for the OS actually), you might end up with more fragments that before.
> Or less. But it will be smaller at least.

My chunked growth keeps fragmentation at bay(for any new sqlite db growth). Old urlclassifier data will remain fragmented, but we can work on creative ways to fix that later.


(In reply to comment #26)
> (In reply to comment #25)
> > Will this patch defragment as well as shrink?
> 
> Vacuuming will create a complete new file. It is not defragmenting (which is an
> issue for the OS actually), you might end up with more fragments that before.
> Or less. But it will be smaller at least.

Not quite. sqlite elects the worst-performing-idea-ever approach to vacuuming. It copies the db into a new file(yay), and then copies that data back into the main file(groan). A saner strategy would've been to hot copy the data to a new file and then rename()+switch file descriptors, but due to silly design constraints on sqlite(ie multiprocess access), this isn't available. 

So vacuum does fix internal fragmentation(and switches db to the modern page size), but it keeps existing internal fragmentation and performs atleast 2x more io than is sane.
Alright, I didn't realize that the file was copied back.
(In reply to comment #27)
> Not quite. sqlite elects the worst-performing-idea-ever approach to vacuuming.
> It copies the db into a new file(yay), and then copies that data back into the
> main file(groan). A saner strategy would've been to hot copy the data to a new
> file and then rename()+switch file descriptors, but due to silly design
> constraints on sqlite(ie multiprocess access), this isn't available. 
Just because it doesn't match *our* use case, doesn't mean it's silly.  There are lots of consumers of SQLite (ie, web server usage) that do this.  In fact, mobile has even been talking about using SQLite in both the content and chrome process.
Not a regression - opportunistic wins don't block.
blocking2.0: final+ → -
Attached patch wip2(debug) (deleted) — Splinter Review
(In reply to comment #30)
> Not a regression - opportunistic wins don't block.
To be clear, we'd likely approve a patch.
What's the current status of this?
Nothing more happened AFIK? Landing bug 673470 will replace SQLite usage for this file with an optimized store.
Depends on: 673470
Bug 673470 landed, making this pointless. Bug 723153 tracks removing the old SQLite files.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: