628921 - Changing from WAL to a rollback journal to change page_size may cause a infinite loop @AsyncExecuteStatements::executeStatement

Reporter

Description

•

14 years ago

So far, I've seen this twice in the past few days. This is with a build using system sqlite version 3.7.4. This doesn't prevent firefox from running fine, but it makes it suck 100% CPU. I'm filing this bug because I'm unsure if it's a sqlite issue or a firefox issue, and I'm looking for guidance as to where to look at. FWIW, I got a crash while tracking what was happening the second time under gdb, though I think it might have been because I only attached to one thread. The stack trace when the cpu sucks 100% looks like this: #0 0x00007f094865d168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f0918dfe760) at ../sysdeps/unix/sysv/linux/fcntl.c:40 #1 0x00007f094865d2c8 in __libc_fcntl (fd=37, cmd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:89 #2 0x00007f09454d2976 in _posixUnlock (id=0x7f092c186150, eFileLock=1) at sqlite3.c:24316 #3 unixUnlock (id=0x7f092c186150, eFileLock=1) at sqlite3.c:24381 #4 0x00007f0945496bf4 in sqlite3OsUnlock (pPager=0x7f092c186008, eLock=1) at sqlite3.c:13784 #5 pagerUnlockDb (pPager=0x7f092c186008, eLock=1) at sqlite3.c:35765 #6 0x00007f0945496f4e in pagerExclusiveLock (pPager=0x7f092c186008) at sqlite3.c:41293 #7 0x00007f0945507230 in sqlite3PagerCloseWal (p=<value optimized out>) at sqlite3.c:41413 #8 sqlite3VdbeExec (p=<value optimized out>) at sqlite3.c:138 #9 0x00007f09454e8820 in sqlite3Step (pStmt=0x7f08e1296708) at sqlite3.c:58706 #10 sqlite3_step (pStmt=0x7f08e1296708) at sqlite3.c:58770 #11 0x00007f09465348a3 in mozilla::storage::stepStmt (aStatement=0x7f08e1296708) at ../../../storage/src/mozStoragePrivateHelpers.cpp:283 #12 0x00007f094653369b in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f08df6444a0, aStatement=0x7f08e1296708) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345 #13 0x00007f094653378a in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement (this=0x7f08df6444a0, aStatement=0x7f08e1296708, aLastStatement=true) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:292 #14 0x00007f0946533b18 in mozilla::storage::AsyncExecuteStatements::Run (this=0x7f08df6444a0) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:601 #15 0x00007f094672c674 in nsThread::ProcessNextEvent (this=0x7f091fe90b30, mayWait=1, result=0x7f0918dfee5c) at ../../../xpcom/threads/nsThread.cpp:633 #16 0x00007f09466f9cb9 in NS_ProcessNextEvent_P (thread=0x25, mayWait=6) at nsThreadUtils.cpp:250 #17 0x00007f094672cd53 in nsThread::ThreadFunc (arg=<value optimized out>) at ../../../xpcom/threads/nsThread.cpp:278 #18 0x00007f094719f7c3 in _pt_root (arg=<value optimized out>) at ptthread.c:228 #19 0x00007f09486558ba in start_thread (arg=<value optimized out>) at pthread_create.c:300 #20 0x00007f09483bd02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #21 0x0000000000000000 in ?? () And I can go up to mozilla::storage::AsyncExecuteStatements::executeStatement with gdb's finish command.

Mike Hommey [:glandium]

Reporter

Updated

•

14 years ago

Status: NEW → UNCONFIRMED

Ever confirmed: false

Shawn Wilsher :sdwilsh

Comment 1

•

14 years ago

Nothing really jumps out at me offhand. If you hit it again, let me know?

Mike Hommey [:glandium]

Reporter

Comment 2

•

14 years ago

(In reply to comment #1) > Nothing really jumps out at me offhand. If you hit it again, let me know? Happening right now

Mike Hommey [:glandium]

Reporter

Comment 3

•

14 years ago

sqlite3_step is returning SQLITE_BUSY, which explains the infinite loop. It doesn't explain what is keeping the database busy, though.

Mike Hommey [:glandium]

Reporter

Comment 4

•

14 years ago

Sadly, after a while, it manages to break the loop, which doesn't help, since I was hoping to keep the process doing its loop until i can get some debugging instructions :-/

Shawn Wilsher :sdwilsh

Comment 5

•

14 years ago

(In reply to comment #4) > Sadly, after a while, it manages to break the loop, which doesn't help, since I > was hoping to keep the process doing its loop until i can get some debugging > instructions :-/ I guess figuring out which database file this is is the way to get more information. We'd have to poke at the nsIFile on the Connection object to figure that out, it looks like.

Mike Hommey [:glandium]

Reporter

Comment 6

•

14 years ago

Got it again with b11, and it looks like it's an access to places: 0x00007f3246b53168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f3221bfd750) at ../sysdeps/unix/sysv/linux/fcntl.c:40 40 ../sysdeps/unix/sysv/linux/fcntl.c: No such file or directory. in ../sysdeps/unix/sysv/linux/fcntl.c (gdb) bt #0 0x00007f3246b53168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f3221bfd750) at ../sysdeps/unix/sysv/linux/fcntl.c:40 #1 0x00007f3246b532c8 in __libc_fcntl (fd=36, cmd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:89 #2 0x00007f32439c2200 in unixLock (id=0x7f3226979d50, eFileLock=4) at sqlite3.c:24063 #3 0x00007f3243988c54 in sqlite3OsLock (pPager=0x7f3226979c08, eLock=4) at sqlite3.c:13781 #4 pagerLockDb (pPager=0x7f3226979c08, eLock=4) at sqlite3.c:35789 #5 0x00007f3243988f3b in pagerExclusiveLock (pPager=0x24) at sqlite3.c:41289 #6 0x00007f32439f9230 in sqlite3PagerCloseWal (p=<value optimized out>) at sqlite3.c:41413 #7 sqlite3VdbeExec (p=<value optimized out>) at sqlite3.c:138 #8 0x00007f32439da820 in sqlite3Step (pStmt=0x7f31e8f9fa88) at sqlite3.c:58706 #9 sqlite3_step (pStmt=0x7f31e8f9fa88) at sqlite3.c:58770 #10 0x00007f3244a2c267 in mozilla::storage::stepStmt (aStatement=0x7f31e8f9fa88) at ../../../storage/src/mozStoragePrivateHelpers.cpp:283 #11 0x00007f3244a2b05f in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f31f4695cc0, aStatement=0x7f31e8f9fa88) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345 #12 0x00007f3244a2b14e in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement (this=0x7f31f4695cc0, aStatement=0x7f31e8f9fa88, aLastStatement=true) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:292 #13 0x00007f3244a2b4dc in mozilla::storage::AsyncExecuteStatements::Run (this=0x7f31f4695cc0) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:601 #14 0x00007f3244c2688c in nsThread::ProcessNextEvent (this=0x7f32226370e0, mayWait=1, result=0x7f3221bfde5c) at ../../../xpcom/threads/nsThread.cpp:633 #15 0x00007f3244bf3ed1 in NS_ProcessNextEvent_P (thread=0x24, mayWait=6) at nsThreadUtils.cpp:250 #16 0x00007f3244c26f6b in nsThread::ThreadFunc (arg=<value optimized out>) at ../../../xpcom/threads/nsThread.cpp:278 #17 0x00007f324569c7c3 in _pt_root (arg=<value optimized out>) at ptthread.c:228 #18 0x00007f3246b4b8ba in start_thread (arg=<value optimized out>) at pthread_create.c:300 #19 0x00007f32468b302d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #20 0x0000000000000000 in ?? () (gdb) frame 11 #11 0x00007f3244a2b05f in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f31f4695cc0, aStatement=0x7f31e8f9fa88) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345 345 ../../../storage/src/mozStorageAsyncStatementExecution.cpp: No such file or directory. in ../../../storage/src/mozStorageAsyncStatementExecution.cpp (gdb) print mConnection $1 = {mRawPtr = 0x7f3226ba9c00} (gdb) print *((Connection *)0x7f3226ba9c00) $2 = {<mozIStorageConnection> = {<nsISupports> = { _vptr.nsISupports = 0x7f3245578d80}, <No data fields>}, <nsIInterfaceRequestor> = {<nsISupports> = { _vptr.nsISupports = 0x7f3245578e90}, <No data fields>}, mRefCnt = {mValue = 111}, sharedAsyncExecutionMutex = {<mozilla::BlockingResourceBase> = {static kResourceTypeName = 0x7f32454247f0}, mLock = 0x7f3226b36df0}, sharedDBMutex = {<mozilla::BlockingResourceBase> = { static kResourceTypeName = <same as static member of an already seen type>}, mMutex = 0x7f322a3cf1f8}, threadOpenedOn = {<nsCOMPtr_base> = {mRawPtr = 0x7f3234d46270}, <No data fields>}, mDBConn = 0x7f3226979808, mDatabaseFile = {<nsCOMPtr_base> = {mRawPtr = 0x7f3226c44b00}, <No data fields>}, mAsyncExecutionThread = {<nsCOMPtr_base> = {mRawPtr = 0x7f32226370e0}, <No data fields>}, mAsyncExecutionThreadShuttingDown = false, mTransactionInProgress = 0, mFunctions = {<nsBaseHashtable<nsCStringHashKey, mozilla::storage::Connection::FunctionInfo, mozilla::storage::Connection::FunctionInfo>> = {<nsTHashtable<nsBaseHashtableET<nsCStringHashKey, mozilla::storage::Connection::FunctionInfo> >> = { mTable = {ops = 0x7f324562fa10, data = 0x0, hashShift = 28, maxAlphaFrac = 192 '\300', minAlphaFrac = 64 '@', entrySize = 40, entryCount = 4, removedCount = 0, generation = 0, entryStore = 0x7f3226979400 ""}}, <No data fields>}, <No data fields>}, mProgressHandler = {<nsCOMPtr_base> = { mRawPtr = 0x0}, <No data fields>}, mFlags = 262150, mStorageService = {mRawPtr = 0x7f322de4a140}} (gdb) print ((Connection *)0x7f3226ba9c00)->getFilename() $3 = {<nsACString_internal> = {mData = 0x7f31ebd7a788 "places.sqlite", mLength = 13, mFlags = 5}, <No data fields>}

Shawn Wilsher :sdwilsh

Comment 7

•

14 years ago

Mike, I'm going to rope you in on an e-mail thread with the SQLite developers on this to debug this. While storage could handle the case of getting SQLITE_BUSY all the time, I chatted with drh today and determined that we shouldn't be getting SQLITE_BUSY.

sommerfeld

Comment 8

•

14 years ago

This looks very similar to a loop I'm seeing on solaris (sqlite3_step returning SQLITE_BUSY, because someone else has a PENDING_LOCK on the database). see bug 633771 which has some information on what I found so far inside sqlite3.c

Dan Kennedy

Comment 9

•

14 years ago

The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a database from WAL mode to rollback mode. The SQL statement executed was probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than WAL). Switching from WAL to rollback mode is much harder than from rollback to WAL. To do so successfully, SQLite requires that all other connections close the database. Not just unlock, but actually close (i.e. call sqlite3_close() or execute a DETACH statement). Otherwise it will return SQLITE_BUSY.

Mike Hommey [:glandium]

Reporter

Comment 10

•

14 years ago

What would I need to poke to know the SQL statement when it happens again?

Dan Kennedy

Comment 11

•

14 years ago

In the debugger? In frame sqlite3VdbeExec(), look at the string p->zSql. Or in sqlite3Step() - ((Vdbe *)pStmt)->zSql. Above that, call sqlite3_sql() on the statement handle.

Shawn Wilsher :sdwilsh

Comment 12

•

14 years ago

(In reply to comment #9) > The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a > database from WAL mode to rollback mode. The SQL statement executed was > probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than > WAL). We should only be doing that if we fail to set WAL mode (if this is places.sqlite).

Marco Bonardo [:mak]

Assignee

Comment 13

•

14 years ago

(In reply to comment #12) > We should only be doing that if we fail to set WAL mode (if this is > places.sqlite). We also try to do that in the vacuum manager if page size is not the expected one, WAL does not allow to change page size on a VACUUM. Checking if Glandium's places.sqlite has a wrong page size is easy, could you just run a PRAGMA page_size; on it please? This should only happen once every 30 days though, after a 5 minutes idle, unless the two above stack have been taken with 2 different profiles or the process was interrupted (crash, hang, kill). The "failing to set wal" case would happen really early on startup when no statement exists yet.

Marco Bonardo [:mak]

Assignee

Comment 14

•

14 years ago

And based on comment 9, we are hosed because we have cloned connections and we don't close them. So this makes fixing page size pretty much hard if not impossible.

Dan Kennedy

Comment 15

•

14 years ago

(In reply to comment #12) > (In reply to comment #9) > > The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a > > database from WAL mode to rollback mode. The SQL statement executed was > > probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than > > WAL). > We should only be doing that if we fail to set WAL mode (if this is > places.sqlite). I think if you actually fail to switch into WAL mode, the subsequent "PRAGMA journal_mode = [DELETE|PERSIST|TRUNCATE]" should not need to take the exclusive lock.

Shawn Wilsher :sdwilsh

Comment 16

•

14 years ago

(In reply to comment #13) > We also try to do that in the vacuum manager if page size is not the expected > one, WAL does not allow to change page size on a VACUUM. Oof, I forgot about that. :(

Mike Hommey [:glandium]

Reporter

Comment 17

•

14 years ago

(In reply to comment #13) > (In reply to comment #12) > > We should only be doing that if we fail to set WAL mode (if this is > > places.sqlite). > > We also try to do that in the vacuum manager if page size is not the expected > one, WAL does not allow to change page size on a VACUUM. > Checking if Glandium's places.sqlite has a wrong page size is easy, could you > just run a PRAGMA page_size; on it please? sqlite> PRAGMA page_size; 1024 > This should only happen once every 30 days though, after a 5 minutes idle, > unless the two above stack have been taken with 2 different profiles or the > process was interrupted (crash, hang, kill). This happened at least 5 times so far, within the last 2 weeks, with the same profile.

Marco Bonardo [:mak]

Assignee

Comment 18

•

14 years ago

So, imo this is absolutely our try to fix the page size. could you please check what's the value of the about:config option storage.vacuum.last.places.sqlite?

Mike Hommey [:glandium]

Reporter

Comment 19

•

14 years ago

(In reply to comment #18) > So, imo this is absolutely our try to fix the page size. > could you please check what's the value of the about:config option > storage.vacuum.last.places.sqlite? 1297709999

Marco Bonardo [:mak]

Assignee

Comment 20

•

14 years ago

(In reply to comment #19) > (In reply to comment #18) > > So, imo this is absolutely our try to fix the page size. > > could you please check what's the value of the about:config option > > storage.vacuum.last.places.sqlite? > > 1297709999 this is yesterday. Now I see what happens, we usually vacuum every 30 days, but if we notice the page size is not the expected one we proceed regardless. So for you this could happen every day... Based on this I'm going to confirm the bug. Shawn, at this point, unless we find a good alternative for fixing page size, we could want to do a simple one-line patch to disable page size correction... The only alternative I see is to delay the vacuum to xpcom-shutdown, when the connections should be closed, but without any visible window it could hurt the user (would happen only once though).

Status: UNCONFIRMED → NEW

Ever confirmed: true

Mike Hommey [:glandium]

Reporter

Comment 21

•

14 years ago

My profile is very old, so I guess a lot of users using an old profile like me may have the same kind of issues in the future. Couldn't the vacuum take place during profile upgrade ?

Marco Bonardo [:mak]

Assignee

Comment 22

•

14 years ago

The problem is not the vacuum, the problem is that page size can be changed only by a vacuum, but not if the journal mode is wal. I'm not sure if SQLite team has any clue if this will be possible in a future version. It's possible we could change so that in this specific case we run a vacuum on profile upgrade, but doesn't look something feasible in the 4.0 timeframe. That's why I think the only viable solution today, is to disable page size change on wal databases till next version.

Shawn Wilsher :sdwilsh

Comment 23

•

14 years ago

OK, so this turns out to be pretty bad. For users that have an the default page size of 1024, we have a high chance of hitting this during the 3.6 -> 4.0 upgrade. 4.0 beta users are not seeing this because our vacuuming logic landed before Places started to use WAL. mak or I can own this bug, and we'll work that out shortly...

blocking2.0: --- → final+

Whiteboard: [hardblocker]

Johnathan Nightingale [:johnath]

Comment 24

•

14 years ago

(In reply to comment #23) > mak or I can own this bug, and we'll work that out shortly... By all means, but marking it yours for the moment.

Assignee: nobody → sdwilsh

Shawn Wilsher :sdwilsh

Comment 25

•

14 years ago

Talked this over with Marco. He's going to take this, and we are going to go with the safest approach here. The patch is going to just not try to change the page size when the database is in WAL mode. In the future, we'll try to figure out a way around this, but for the time being, we'll just have to deal with users having a small page size.

Assignee: sdwilsh → mak77

Status: NEW → ASSIGNED

Marco Bonardo [:mak]

Assignee

Updated

•

14 years ago

Summary: Infinite loop @AsyncExecuteStatements::executeStatement → Changing from WAL to a rollback journal to change page_size may cause a infinite loop @AsyncExecuteStatements::executeStatement

Marco Bonardo [:mak]

Assignee

Updated

•

14 years ago

Blocks: 634374

Marco Bonardo [:mak]

Assignee

Comment 26

•

14 years ago

Attached patch patch v1.0 (deleted) — Details — Splinter Review

I'm sorry we have to rollback on this perf win, but after all since it was not working we are not regressing anything. We can properly fix this for the next version.

Attachment #512649 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 27

•

14 years ago

Comment on attachment 512649 [details] [diff] [review] patch v1.0 r=sdwilsh

Attachment #512649 - Flags: review?(sdwilsh) → review+

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

Whiteboard: [hardblocker] → [hardblocker][has patch][can land]

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

Flags: in-testsuite-

Target Milestone: --- → mozilla2.0b12

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

Depends on: 541373

Marco Bonardo [:mak]

Assignee

Comment 28

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/49186a8f4fa2

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED