Closed Bug 416330 Opened 17 years ago Closed 14 years ago

Suboptimal SQLite page size

Categories

(Toolkit :: Storage, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: taras.mozilla)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [ts])

Attachments

(2 files, 8 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 I noticed that all the sqlite dbs used by firefox have a PRAGMA page_size of 1024. As explained in http://www.sqlite.org/cvstrac/wiki?p=PerformanceTuningWindows though, the page size should be set equal to that of the underlying file system to maximize performance (my NTFS partition was created with a page size of 4K). When the databases are created, the file system page size should be probed and the PRAGMA should be set correctly. Reproducible: Always
Component: General → Places
QA Contact: general → places
this has been addressed in places some time ago, but still we can't change old places.sqlite with 1024 to the new pagesize... See Bug 401985 and Bug 402076. Actual created places.sqlite have page_size = 4096. it should be done for other dbs, not places, so i'm moving this into General. i ask myself if there's a way to make this change global, so that mozStorage could automatically set the page size based on the OS (so i'm cc-ing sdwilsh here that should know more about that)
Component: Places → General
QA Contact: places → general
I suspect that we'll want to do this with any and all connections. Moving to storage.
Status: UNCONFIRMED → NEW
Component: General → Storage
Ever confirmed: true
Product: Firefox → Toolkit
QA Contact: general → storage
Version: unspecified → Trunk
As of SQLite version 3.5.8, you can change the page size of a database using the following command sequence: PRAGMA page_size=4096; VACUUM; You can do this on the places.sqlite database (for example) using an SQLite CLI that you download from the SQLite website and it will change the page size on a live database - using a transaction to protect your data in the event of a power loss or crash. Firefox should continue to operate normally. In fact, you should be able to make this change while FF is running. Perhaps you can experiment with different page sizes and let us know what page sizes work better on various systems. Page sizes can be any power of two between 512 and 32768. Invalid page sizes (ex: 1000 or 8191) are silently ignored. Note that the VACUUM command can take about 1 seconds/megabyte to run (more or less, depending on your hardware). So if you have a really big places.sqlite database, please expect to wait a few seconds when you change the page size. SQLite has to rewrite the entire database file three times in order to pull this trick off safely. As of SQLite version 3.5.0, the "VFS" operating-system porting layer has methods that are designed to help SQLite pick an optimal page size for the underlying filesystem. But the default methods for both unix and win32 provide values that cause SQLite to choose a 1024 byte page every time. This is mostly because we do not know how to discover the necessary information on windows and we are reasonable sure there is no portable way to discover the needed parameters on windows. It anybody wants to try to work out better ways of picking a default page size, please contact me directly and I will be happy to advise you as to what needs to be done. If you can provide an algorithm, we (the SQLite developers) will implement it.
I'm not sure what you're looking for when you say "portable way to discover the needed parameters", but there does appear to be a way to get the cluster size in bytes for NTFS volumes using the DeviceIoControl - http://msdn.microsoft.com/en-us/library/aa364569(VS.85).aspx FAT32 seems more problematic, I haven't found a way yet to get that reliably through a single call. GetDiskFreeSpace apparently can do it, but only on partitions smaller than 2GB. Seeing as how NTFS is the default for newer windows variants, it might be worth while setting it correctly for NTFS going forward.
Is that a C API Jim? Also, can you determine what type of file system it is problematically?
FYI I opened a bug on sqlite.org some time ago: http://www.sqlite.org/cvstrac/tktview?tn=2931
>Is that a C API Jim? Also, can you determine what type of file system it is >problematically? Yes, it's a win32 api call. (not platform independent obviously, not sure if they were looking for something like this or a standard library call - which I didn't find.) I haven't searched for FAT32 detection calls, but my guess is this call would fail for anything other than NTFS partitions. Also, it's only available on 2K and up, so GetProcAddress should be used. The header info is in winbase.h.
Keywords: perf
Blocks: 572459
(In reply to comment #6) > FYI I opened a bug on sqlite.org some time ago: > http://www.sqlite.org/cvstrac/tktview?tn=2931 Note that http://www.sqlite.org/src/tktview/ba7fdb568d6c80b90dfdc66d44e8a6f29a9ebd0f seems to be more important now.
(In reply to comment #7) > Also, it's only available on 2K and up, so GetProcAddress should be used. The > header info is in winbase.h. We don't need to do this, because we don't support anything below Win2k.
Whiteboard: [ts]
This is *anything* but minor!
Severity: minor → normal
(In reply to comment #4) > FAT32 seems more problematic, I haven't found a way yet to get that reliably > through a single call. GetDiskFreeSpace apparently can do it, but only on > partitions smaller than 2GB. As far as I know, the only thing that might be returned incorrectly in volumes larger than 2GB with GetDiskFreeSpace is the total amount of free space. Bytes per sector should be reported correctly regardless.
The discussion here seems to be combining 2 distinct concepts. Every file-system has some block-size setting, optimizing for that yields the most compact layout on disk. On the other hand, one should never-ever read in those block sizes. They are designed for efficiently packing files only. One should always read some (large) multiple of that. For example I've spent a lot of time investigating how binaries get paged in from disk which is equivalent to doing random io queries in sqlite. There windows is hardcoded to 8K(for really tiny data mmap areas) and 32K(for huge code areas). Linux always reads in 128K chunks and that will be bumped up soon hopefully. For mmap. Linux and other sane Unices will bump the read size to something respectable (like 2MB) chunks when the userspace program indicates it is interested in a particular file chunk via madvise. If sqlite knows the io ranges it is about to read it should really consider doing fadvise on unix-like platforms by default. Storage likes bulky IO, so at least for Mozilla I'd suggest 32K queries as a minium and maybe 512K chunks for databases that do SELECT * and bulky writes. (originally misposted this in bug 541373)
Attached patch 32K test (obsolete) (deleted) — Splinter Review
Assignee: nobody → tglek
I tried above test on try server. Results are pretty interesting(rev 87e7b2736018). It appears that Ts_med/max_generated[_shutdown] are significantly improved(20-30%) on osx/windows and somewhat improved on linux(as expected). On the other hand Ts_min/rss/etc are slightly worse. Conclusion: system calls are expensive on some OSes, larger buffers cut those down a lot. I'm not completely sure as to why the other measures like rss are different, could be some try-server artifact. This also suggests that scaling page size alone isn't the right thing to do. Should scale read/write buffers with file size. Alternatively we can just switch to mmap io and skip the buffers altogether(let the os decide how much to page in, should work well on unix..might work ok on windows). I think in the short term we should bump up the page size from default 1K. 32K seems like it's a bit too ambitious, 8 or 16 might be a reasonable tradeoff.
(In reply to comment #14) > This also suggests that scaling page size alone isn't the right thing to do. > Should scale read/write buffers with file size. Alternatively we can just > switch to mmap io and skip the buffers altogether(let the os decide how much to > page in, should work well on unix..might work ok on windows). This sounds like a good idea. One of the reasons Thunderbird has an explicit 1k page size for our global message database is that we have long-running transactions and would hold our page cache size constant-ish regardless of page size. Larger page size => faster exhaustion of the page cache on transactions that mutate a lot of pages, faster spills to disk/writes with fsyncs, plus a larger transaction journal. (Because we use FTS3 and segment merges will always result in tremendous page-touching, even if we used more transactions, this would still be a concern.) I believe most firefox SQLite operations are unlikely to ever exceed the cache size (at least until you start using FTS3), so that's a different workload entirely. And I suspect we are all agreed it makes sense to generally optimize for Firefox. Taras, one sorta-aside question I have for you is whether you have been looking at the 'fragmentation' of places (or other firefox db's) in terms of how much larger linear reads (potentially) reduce the need for future disk seeks. Obviously, when there is no memory pressure, causing the OS to cache large segments of the file without additional seek latency is a serious win. But if only one page per megabyte is ever 'hot' and there is not sufficient memory to cache the whole file, reading less could theoretically be a win. I expect that if one logged page requests inside SQLite (and therefore ignoring page cache effects and actual disk requests), one could attempt to simulate the effects of different page cache sizes and benefits of larger (linear) disk reads.
Gonna postphone any landings until b2. Need to determine ideal constant size + vacuum strategy. (In reply to comment #15) > (In reply to comment #14) > > This also suggests that scaling page size alone isn't the right thing to do. > > Should scale read/write buffers with file size. Alternatively we can just > > switch to mmap io and skip the buffers altogether(let the os decide how much to > > page in, should work well on unix..might work ok on windows). > > This sounds like a good idea. One of the reasons Thunderbird has an explicit > 1k page size for our global message database is that we have long-running > transactions and would hold our page cache size constant-ish regardless of page > size. Larger page size => faster exhaustion of the page cache on transactions > that mutate a lot of pages, faster spills to disk/writes with fsyncs, plus a > larger transaction journal. (Because we use FTS3 and segment merges will > always result in tremendous page-touching, even if we used more transactions, > this would still be a concern.) Yup. Large pages do sound like a bad idea there. Sounds like mmap may be the only way out for tb. > > I believe most firefox SQLite operations are unlikely to ever exceed the cache > size (at least until you start using FTS3), so that's a different workload > entirely. And I suspect we are all agreed it makes sense to generally optimize > for Firefox. What's fts3? > > Taras, one sorta-aside question I have for you is whether you have been looking > at the 'fragmentation' of places (or other firefox db's) in terms of how much > larger linear reads (potentially) reduce the need for future disk seeks. > Obviously, when there is no memory pressure, causing the OS to cache large > segments of the file without additional seek latency is a serious win. But if > only one page per megabyte is ever 'hot' and there is not sufficient memory to > cache the whole file, reading less could theoretically be a win. I expect that > if one logged page requests inside SQLite (and therefore ignoring page cache > effects and actual disk requests), one could attempt to simulate the effects of > different page cache sizes and benefits of larger (linear) disk reads. From the sqlite graphs I saw the io is somewhat linear. It ranges from perfectly linear select * on a vacuumed db to a nasty seek-frenzy in a extremely non-vacuumed case. Thus bug 572460. It looks like the main cost of small pages is syscall overhead. Linear-ish io means the OS cache has a chance to do right thing(tm). It just doesn't work as well when you do 16k iops instead of <1k...ie on large places dbs.
(In reply to comment #16) > What's fts3? Full-text search support: http://www.sqlite.org/fts3.html A fairly intuitive and efficient inverted-index implementation. The 'segments' (terms, docs, offsets) are immutable and carry generation numbers; when you get enough of a certain generation, they all get merged and the generation gets bumped. Deletions are tracked as negative evidence.
Attached patch swap everything to 32K (obsolete) (deleted) — Splinter Review
I no longer trust try-server numbers for this work. Cold io numbers are too noisy. I experimented with sqlite using an sqlite hello-world type program to measure io how much io is performed. For the generated worst-case dbs(4k vs 32k) slow places queries are up to 2x faster. Vacuums are 20% faster. Other things I learned: I tried implementing sqlite-vfs unixRead function via mmap. That performs slightly better in some cases and significantly worse in others. I suspect it's because even if sqlite-io is nearly linear there are usually significant gaps between reads, so mmap-readahead causes too much data to be readin. For the same reason madvise/fadvise do more harm than good. In theory one can enumerate the pages that are going to be read in and inform the os via fadvise. Unfortunately, current sqlite api makes it tricky get the exact file offsets ahead of time. Even then, discontinuous io might foil this approach too.
Attachment #452032 - Attachment is obsolete: true
Comment on attachment 453583 [details] [diff] [review] swap everything to 32K asuth, According to your comment this patch might make thunderbird unhappy. If that's the case, please suggest a better way. Cranking up page-size the only straight-forward way to speed up sqlite at the moment.
Attachment #453583 - Flags: review?(bugmail)
Comment on attachment 453583 [details] [diff] [review] swap everything to 32K Thunderbird's gloda layer explicitly sets the page size to 1k and so won't feel any effects. Changing review to sdwilsh since he should weigh in on this and I definitely can't review places code. My one comment is a nit which is you have a typo in "preffered" and "incase" could use a space while you are there. I'm about to revisit a lot of gloda's db stuff, so I will be sure to leverage your findings about the potential serious wins for larger page sizes. Since FTS3 queries are our worst-case scenario and our worst-case FTS3 queries should involve large BLOBs, we could likely see better than 2x speed-ups. (The opposing factor is we're going to start sharding to reduce the worst-case situations.) Thanks for all your investigative work on this!
Attachment #453583 - Flags: review?(bugmail) → review?(sdwilsh)
(In reply to comment #18) > I experimented with sqlite using an sqlite hello-world type program to measure > io how much io is performed. For the generated worst-case dbs(4k vs 32k) slow > places queries are up to 2x faster. Vacuums are 20% faster. Can you elaborate on this? Was this via systemtap/dtrace, checking system/process counters before/after, custom sqlite VFS? (I'm interested for automated performance metrics gathering.)
Attached file sqlite vfs hack for experimentation (deleted) —
(In reply to comment #21) > (In reply to comment #18) > > I experimented with sqlite using an sqlite hello-world type program to measure > > io how much io is performed. For the generated worst-case dbs(4k vs 32k) slow > > places queries are up to 2x faster. Vacuums are 20% faster. > > Can you elaborate on this? Was this via systemtap/dtrace, checking > system/process counters before/after, custom sqlite VFS? (I'm interested for > automated performance metrics gathering.) So here is my testcase. I passed a query that triggered lots of io and measured the difference with "time" program. select sum(id),sum(position),sum(parent) from moz_bookmarks The most surprisingly interesting bit in the script is the "RESET" printf, it points out backwards io in sqlite which is most likely in files with small page sizes and lots of fragmentation. I was surprised by how fragmented sqlite files get...ie the massive reduction in io resulting from vacuuming. It would be really nice if sqlite had a way to autovacuum self. I found I didn't need perf-counters/systemtap in the end because differences in perf were so large. My main gripe with systemtap/etc here is that there is no good cross-filesystem way to tell when a read() call is serviced from cache and much readahead it causes. mmap io is so much easier for that :(
(In reply to comment #22) > The most surprisingly interesting bit in the script is the "RESET" printf, it > points out backwards io in sqlite which is most likely in files with small page > sizes and lots of fragmentation. I was surprised by how fragmented sqlite files > get...ie the massive reduction in io resulting from vacuuming. It would be > really nice if sqlite had a way to autovacuum self. It does, it just tends to cause greater fragmentation over time. We used to use it, and we stopped.
Comment on attachment 453583 [details] [diff] [review] swap everything to 32K For more detailed review comments with context, please see http://reviews.visophyte.org/r/453583/. This needs a test that the page_size is set to what we expect it to be set to after open[Unshared]Database. You likely want to add these two tests to test_storage_connection.js. on file: db/sqlite3/src/Makefile.in line 133 > DEFINES += -DSQLITE_DEFAULT_PAGE_SIZE=32768 You should place this up with all the other SQLite defines and add a comment explaining why we do it like we do for the rest. on file: storage/src/mozStorageConnection.cpp line 396 > // Switch db to preffered page size incase the user vacuums nit: asuth's spelling comment and use a period at the end of your sentence please. on file: storage/src/mozStorageConnection.cpp line 398 > sqlite3_stmt *stmt; > srv = ::sqlite3_prepare_v2(mDBConn, "PRAGMA page_size" , -1, &stmt, NULL); > if (srv == SQLITE_OK) { > ::sqlite3_step(stmt); > int res = sqlite3_column_int(stmt, 0); > if (res < 32768) { > (void)::sqlite3_finalize(stmt); > srv = ::sqlite3_prepare_v2(mDBConn, "PRAGMA page_size=32768" , -1, &stmt, NULL); > if (srv == SQLITE_OK) { > (void)::sqlite3_step(stmt); > } > } > } > (void)::sqlite3_finalize(stmt); Couple things: 1) You need to use prepareStmt and not the raw SQLite functions here like the rest of this method. 2) Is there any reason why we don't just blindly set it? Seems like executing two queries sometimes is going to be more expensive than one query all the time. on file: toolkit/components/places/src/nsNavHistory.cpp line 634 > bool databaseInitialized = (currentSchemaVersion > 0); move this down to the location of first use please on file: toolkit/components/places/src/nsNavHistory.cpp line 642 > nit: you added trailing whitespace here on file: toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp line 3167 > rv = mGetPageSizeStatement->ExecuteStep(&hasResult); You need to use a mozStorageStatementScoper on this otherwise subsequent executions will fail. on file: toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp line 3170 > if (hasResult) { You should assert that this is always true, but it shouldn't ever be false I'm also not technically able to review the urlclassifier stuff, however you can probably get an exemption from Mossop (toolkit owner) saying I can for this patch. Have him comment in this bug accordingly please.
Attachment #453583 - Flags: review?(sdwilsh) → review-
Attached patch 32k r2 (obsolete) (deleted) — Splinter Review
Attachment #453583 - Attachment is obsolete: true
Attachment #453865 - Flags: review?(sdwilsh)
(In reply to comment #24) > I'm also not technically able to review the urlclassifier stuff, however you > can probably get an exemption from Mossop (toolkit owner) saying I can for this > patch. Have him comment in this bug accordingly please. It is all storage related changes so I trust you if you say you are ok to review this.
Comment on attachment 453865 [details] [diff] [review] 32k r2 >+++ b/storage/src/mozStorageConnection.cpp >+ // Switch db to preferred page size in case the user vacuums. >+ sqlite3_stmt *stmt; >+ srv = prepareStmt(mDBConn, NS_LITERAL_CSTRING("PRAGMA page_size=32768"), >+ &stmt); nit: spaces around '=' >+ if (srv == SQLITE_OK) { >+ ::sqlite3_step(stmt); nit: (void) return result, and you need to use stepStmt here. >+++ b/storage/test/unit/test_32k.js nit: more descriptive filename please (something like test_page_size_is_32k.js) nit: use creative commons public domain license per http://www.mozilla.org/MPL/license-policy.html >+function run_test() >+{ >+ var stmt = getDatabase(do_get_file("new_32k_db.sqlite")).createStatement("PRAGMA page_size"); >+ stmt.executeStep(); >+ do_check_eq(stmt.getInt32(0), 32768); make this value a const please and then we need to test unshared connections too (getDatabase just does openDatabase. We need to test openUnsharedDatabase too). >+++ b/toolkit/components/places/src/nsNavHistory.cpp >+ bool databaseInitialized = (currentSchemaVersion > 0); >+ > if (!databaseInitialized) { nit: lose the extra newline here. r=sdwilsh with comments addressed
Attachment #453865 - Flags: review?(sdwilsh) → review+
Attached patch 32k addressed nits (obsolete) (deleted) — Splinter Review
carrying over Shawn's review
Attachment #453865 - Attachment is obsolete: true
Attachment #453880 - Flags: review+
Attached patch 32k final (obsolete) (deleted) — Splinter Review
Touched up the testcase, this is ready for checkin.
Attachment #453880 - Attachment is obsolete: true
Attachment #453883 - Flags: review+
Keywords: checkin-needed
Attached patch 32k final (obsolete) (deleted) — Splinter Review
Forgot to qref. Final patch for real.
Attachment #453883 - Attachment is obsolete: true
Attachment #453884 - Flags: review+
Attachment #453809 - Attachment description: sqlite testcase → sqlite vfs hack for experimentation
Status: NEW → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Backed out b/c of Assertion failed: (SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE), function sqlite3PagerOpen, file /builds/slave/mozilla-central-macosx64-debug/build/storage/../db/sqlite3/src/sqlite3.c, line 35430. TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #32) > Backed out b/c of > > Assertion failed: (SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE), > function sqlite3PagerOpen, file > /builds/slave/mozilla-central-macosx64-debug/build/storage/../db/sqlite3/src/sqlite3.c, > line 35430. > TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run That's....weird? http://mxr.mozilla.org/mozilla-central/source/db/sqlite3/src/sqlite3.c#237 means that shouldn't fail...
244 #ifndef SQLITE_MAX_DEFAULT_PAGE_SIZE 245 # define SQLITE_MAX_DEFAULT_PAGE_SIZE 8192 in our case is 8192 since we don't define a new value 247 #if SQLITE_MAX_DEFAULT_PAGE_SIZE>SQLITE_MAX_PAGE_SIZE if (8192 > 32768) { 248 # undef SQLITE_MAX_DEFAULT_PAGE_SIZE 249 # define SQLITE_MAX_DEFAULT_PAGE_SIZE SQLITE_MAX_PAGE_SIZE } so SQLITE_MAX_DEFAULT_PAGE_SIZE is still 8192 thus assert(SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE) is assert(32768<=8192)
Attached patch 32k debug fix (obsolete) (deleted) — Splinter Review
Attachment #453884 - Attachment is obsolete: true
Attachment #454100 - Flags: review?(sdwilsh)
Comment on attachment 454100 [details] [diff] [review] 32k debug fix >+++ b/db/sqlite3/src/Makefile.in >+# -DSQLITE_DEFAULT_PAGE_SIZE=32768 and SQLITE_MAX_DEFAULT_PAGE_SIZE=32768 >+# increases the page size from 1k, see bug 416330. nit: drop the reference to the bug, and say "...from 1k to 32k." r=sdwilsh
Attachment #454100 - Flags: review?(sdwilsh) → review+
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Attached patch fix (obsolete) (deleted) — Splinter Review
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch final(passes try server) (deleted) — Splinter Review
Was too rushed to run through try server last time, this patch fixes the typo in the "fix" patch.
Attachment #454100 - Attachment is obsolete: true
Attachment #454669 - Attachment is obsolete: true
Attachment #454936 - Flags: review+
Keywords: checkin-needed
Depends on: 575788
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Has this been tested with system sqlite ?
(In reply to comment #43) > Has this been tested with system sqlite ? No, but I don't see why it wouldn't work. Only difference is that with system sqlite you get a small page size to start with, but after the first vacuum you'll switch to larger pages.
This appears to have caused a trace-malloc leak regression (at least on Windows): http://graphs.mozilla.org/graph.html#tests=[[28,1,107]]&sel=1276298155,1279306931
It also appears to have caused a trace-malloc MaxHeap regression across platforms.
(In reply to comment #45) > This appears to have caused a trace-malloc leak regression (at least on > Windows): > > http://graphs.mozilla.org/graph.html#tests=[[28,1,107]]&sel=1276298155,1279306931 (In reply to comment #46) > It also appears to have caused a trace-malloc MaxHeap regression across > platforms. Both of these should be expected, as bz said in dev-tree-management
Blocks: 572559
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: