686025 - nsNavHistory::AsyncExecuteLegacyQueries uses synchronous createStatement call instead of async createAsyncStatement call, blocks main thread

Reporter

Description

•

13 years ago

Attached file stack traces while the browser is pegged (deleted) — Details

I got a new laptop from our IT group. It has Windows 7 (64 bit) on it and I'm using it with the default OS that came on it. I installed Firefox 6 (release) on it, set it up with my sync account and since then I've been getting occasional hangs. These hangs: 1. Happen pretty regularly every 5-20 minutes. 2. During that time the browser UI is unresponsive for 20-60 seconds. 3. Firefox pegs one CPU while the browser is unresponsive. 4. It does eventually come back. I've seen it do this reliably without no plugins enabled and with no extensions enabled. I finally got a working x86 debugger installed. Stack traces all look like something is looking at my session history.

Christopher Blizzard (:blizzard)

Reporter

Comment 1

•

13 years ago

Oh, I've seen this with Sync disabled, and I don't think that the sync service does anything if it's turned off.

Philipp von Weitershausen [:philikon]

Comment 2

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #1) > Oh, I've seen this with Sync disabled, and I don't think that the sync > service does anything if it's turned off. Correct. The Sync code isn't loaded at all if you don't have Sync configured.

Matthias Versen [:Matti]

Comment 3

•

13 years ago

Do you use a 32bit FF or a 64bit one ?

Johnathan Nightingale [:johnath]

Comment 4

•

13 years ago

Main thread stacks point to nsNavHistory->sqlite -- adding the storage assault squadron

Christopher Blizzard (:blizzard)

Reporter

Comment 5

•

13 years ago

I'm using a 32 bit build on 64 bit windows. (I actually typed "firefox.com" into IE to download it!)

Boris Zbarsky [:bzbarsky]

Comment 6

•

13 years ago

Looking at the stack, we're going through this code: 476 _getHistoryResults: 477 function WTBLJL__getHistoryResults(aSortingMode, aLimit, aCallback, aScope) { in WindowsJumpLists.jsm. It looks like each stack is actually in statement preparation....

Johnny Stenback (:jst)

Comment 7

•

13 years ago

Chris, I might have seen something similar as well, but I've been utterly unable to find the time to look into it. Do you see the same hangs if you attempt to open up the history menu? For me the problem went away when I disabled sync, blew away my places.sqlite file from my profile, and re-synched. The problem did come back once after doing that, and that's as far as I got. I believe I have a copy of my places.sqlite from when I was seeing this, but again, I have yet to find time to dig in here. At first I thought this was due to a local db corruption problem, but if you're seeing the same thing then I think we have a larger problem at hand here. What little I did get out of looking into this so far was that the main thread was waiting on a sqlite lock while a background thread was doing stuff with the db while holding the same lock, but that was from an optimized build etc so I didn't get a clear picture of what was going on at the time...

Matthias Versen [:Matti]

Comment 8

•

13 years ago

bug 679800 could be related

Boris Zbarsky [:bzbarsky]

Comment 9

•

13 years ago

Hmm. The stacks above show exactly what jst saw: the main thread is waiting for a lock while an async statement is running on a background thread. I guess the real question is what that statement on the background thread is.

Matt Evans [:mevans]

Comment 10

•

13 years ago

Chris, can you try with a clean profile and see if you get the same results? Otherwise I suggest we try to get your profile data and try that in another clean win7 platform install.

Arjen

Comment 11

•

13 years ago

I have the same issue, I took a stack trace dump as well. So I'll attach it shortly. My system is a windows 7 64-bit one as well, with 32-bit firefox.

Arjen

Comment 12

•

13 years ago

Attached file Another stacktrace - on a different system - while the browser hangs (deleted) — Details

Stack trace taken when browser hangs, this one was upon opening a url from outside firefox.

mjh563

Comment 13

•

13 years ago

I've been getting permanent hangs and 100% cpu when opening the Page Info window, and occasional hangs for up to 60s when clicking the Back button. Clearing browsing and download history seems to fix it, if that's any help.

Brendan Eich [:brendan]

Comment 14

•

13 years ago

Why are we using a mutex to make the main thread wait for prepared (and possibly re-prepared and retried) statements to execute? This seems inherently janky and (in the case of this bug) hang-prone. /be

Brendan Eich [:brendan]

Comment 15

•

13 years ago

Thread 20 is holding the mutex, from mozStorage code. What change sets hit ff6 in that area? /be

Boris Zbarsky [:bzbarsky]

Comment 16

•

13 years ago

Brendan, the main thread is trying to prepare a statement every time that jumplist code is called. That preparation is what tries to grab the mutex. If the jumplist code used an already-prepared statement and just ran it async on a background thread it might well not need the mutex on the main thread at all. That said, thread 20 is just doing an async query on the history. The real question is who asked for that query to happen, and as usual with async stuff that information is not captured in the thread stacks....

Brendan Eich [:brendan]

Comment 17

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #16) > Brendan, the main thread is trying to prepare a statement every time that > jumplist code is called. That preparation is what tries to grab the mutex. Yes, I gathered that much. Did something I wrote say otherwise? The sqlite3_step code active on thread 20 does execute with retry and re-prepare on schema change, though. Do we know that thread is not ilooping? > If the jumplist code used an already-prepared statement and just ran it > async on a background thread it might well not need the mutex on the main > thread at all. Maybe. The mutex serializes lots of critical sections, though. Finer grained locking or lock-free methods do look possible. Worth a bug? I don't thnk you can assume thread 20 is making progress. This could be a livelock. But could it be an ilooping? /be > > That said, thread 20 is just doing an async query on the history. The real > question is who asked for that query to happen, and as usual with async > stuff that information is not captured in the thread stacks....

Brendan Eich [:brendan]

Comment 18

•

13 years ago

(thanks, iPad spelling corrector!) IOW the stack shows thread 20 owning the mutex. That could be significant, not incidental. The number of mutex-guarded critical sections, and their apparent lengths and data at stake, in sqlite3 is alarming. /be

Christopher Blizzard (:blizzard)

Reporter

Comment 19

•

13 years ago

Johnny, I can open the history menu without problems. Is there a way for me to see all sqlite operations as they are executed, either dumped to the error console or stderr?

Dietrich Ayala (:dietrich)

Comment 20

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #19) > Is there a way for me to see all sqlite operations as they are executed, > either dumped to the error console or stderr? export NSPR_LOG_MODULES=mozStorage:5 ^ will send queries to stdout

Andrew Sutherland [:asuth] (he/him)

Comment 21

•

13 years ago

nsNavHistory::AsyncExecuteLegacyQueries is using createStatement when it looks like it wants to be using createAsyncStatement (see http://mxr.mozilla.org/mozilla-central/source/storage/public/mozIStorageConnection.idl#182 for the right signature), unless I am missing something. Bad usage: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsNavHistory.cpp#5188 createAsyncStatement does not acquire a mutex on the main thread because it defers the actual construction of the statement to the asynchronous thread. Mutex-wise, yes, each SQLite connection is protected by a single mutex. All of our asynchronous operations are designed to make sure that mutex is only acquired on the asynchronous thread. Unfortunately, the synchronous statements can be used both ways, so it's not immediately obvious if you choose the wrong one. Regrettably the IDL does not seem to make the danger clear on this front, although I think the MDN docs may do so? In the storage component, we have a C++ unit test to make sure code does not accidentally acquire a mutex on the main thread when it should not. Perhaps places should adopt a similar unit test to make sure that it avoids regressing its mutex behavior.

Andrew Sutherland [:asuth] (he/him)

Updated

•

13 years ago

Component: History: Global → Places

Product: Core → Toolkit

QA Contact: history.global → places

Summary: hangs every few minutes with Firefox 6 → nsNavHistory::AsyncExecuteLegacyQueries uses synchronous createStatement call instead of async createAsyncStatement call, blocks main thread

Version: 6 Branch → Trunk

Christopher Blizzard (:blizzard)

Reporter

Comment 22

•

13 years ago

Note that the hangs happen consistently every 10 minutes. I can often time it. Someone said to me that there's some sql maintenance thread that wakes up every 10 minutes or so.

Benjamin Smedberg

Comment 23

•

13 years ago

jst is experiencing a very similar problem where a database will hang for minutes. Apart from the incorrect createStatement call, we need to understand why a database operation off the main thread is taking so long.

Arjen

Comment 24

•

13 years ago

(In reply to Dietrich Ayala (:dietrich) from comment #20) > export NSPR_LOG_MODULES=mozStorage:5 > > ^ will send queries to stdout If I do in a cmd-prompt on my windows: set NSPR_LOG_FILE=%TEMP%\log.txt set NSPR_LOG_MODULES=mozStorage:5 And then start firefox from that same prompt, a log.txt-file is created, but it doesn't get any content... So, is that actually the way I'm supposed to get a log of all sqlite-statements?

Shawn Wilsher :sdwilsh

Comment 25

•

13 years ago

The problem here is that Places (along with lots of other places in our code) still creates and uses synchronous database statements. SQLite is only threadsafe because it serializes all access to a database connection (this is unlikely to change any time soon). To make this problem worse, every time we "fix" an area that does this, we end up putting more work on the background thread, which increases the likelihood that the remaining places that need to acquire the mutex on the main thread will encounter contention for the mutex. Some of you might recall we hit this problem in the run-up to Firefox 4 as well (November/December of last year), and it caused Marco and I to have to reachitect a bunch of stuff in Places for a few months in order to work around it. Until Firefox removes all uses of the synchronous Storage API from the main thread, this issue will keep rearing it's head. (When that happens we can actually use SQLite in a way that stops using mutexs and will likely speed it up too.)

Benjamin Smedberg

Comment 26

•

13 years ago

What do we need to do to diagnose the particular "hang every 5 or 10 minutes" issue that blizzard and jst are seeing? Would breaking into the storage thread to see what statement is taking "forever" help diagnose the problem? In attachment 559589 [details] that would presumably something in the stackframe of mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement. What member gives us the SQL statement that is actually going to run?

Christopher Blizzard (:blizzard)

Reporter

Comment 27

•

13 years ago

I'm running with a debug build to try to catch the query.

Marco Bonardo [:mak]

Assignee

Comment 28

•

13 years ago

May be jumplists, expiration, Sync, there's a lot of work done at intervals, most of it is off main-thread, but that's also a recipe for contention, as explained in comment 25. We keep working to move all the stuff to the async thread, but it's a long and tedious process that breaks all of our API consumers. But we have some consumer that matters more like mobile and Sync and we worked mostly in their direction when changing APIs. As a first step, I'll audit all of our usage of CreateStatement to check which of them may be immediately converted to CreateAsyncStatement, and see if there's any other statement that can easily be moved. Surely it's possible some query is behaving slowly with some specific database contents, so knowing the query may help to further improve it.

Marco Bonardo [:mak]

Assignee

Comment 29

•

13 years ago

Attached patch patch v1.0 (obsolete) (deleted) — Details — Splinter Review

This is a first pass, that I want to triple-check still. There are various places in the codebase using createStatement when they may use the async version and it's worth to fix the easy/safe ones. - CookieService (done) - some tests in Storage (done only a couple I wrote since some _wants_ to use createStatement) - nsNavHistory (done) - AddonRepository always async executes (done with a bit of renaming for clarity) - XPIProvider uses most statements both sync and async (won't touch it) - nsContentPrefService.js uses most statements both sync and async (won't touch it) Dave, would my suggested changes to AddonRepository be fine? I'm thinking if maybe we may print out a warning when we destroy a statement that was created with createStatement but never executed synchronously. Looks like this may be trackable with a debug property on the statement?

Assignee: nobody → mak77

Status: NEW → ASSIGNED

Attachment #561203 - Flags: feedback?(dtownsend)

Johnathan Nightingale [:johnath]

Comment 30

•

13 years ago

mak - should we throw this against try and give blizzard a test build?

Brendan Eich [:brendan]

Comment 31

•

13 years ago

(In reply to Andrew Sutherland (:asuth) from comment #21) > In the storage component, we have a C++ unit test to make sure code does not > accidentally acquire a mutex on the main thread when it should not. Perhaps > places should adopt a similar unit test to make sure that it avoids > regressing its mutex behavior. More radical idea: remove the synchronous API entirely in the version we link with? /be

Christopher Blizzard (:blizzard)

Reporter

Comment 32

•

13 years ago

I'm going to try and apply this patch to my Firefox 6 tree by hand and see if it makes a difference. I suspect it will hide an underlying problem, but that's probably for another bug.

Marco Bonardo [:mak]

Assignee

Comment 33

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #32) > I suspect it will hide an underlying problem, but that's probably for > another bug. It's not a suspect. The underlying problem is that Places is using both synchronous and asynchronous Storage APIs (unfortunately this happens for other components too), and that by itself creates contention. The solution is to re-architecture Places to be completely asynchronous, that is a wip from quite some time (starting from when autocomplete was rewritten, to fsyncs reduction, usage of WAL, favicons and visits and Sync made mostly async, and so on...). The patch by itself will just reduce some possibilities of contention (among which the above reported stack), but cannot solve an architectural problem.

Benjamin Smedberg

Comment 34

•

13 years ago

I don't think any of us are arguing about the general problem which needs to be fixed. But the specific problem of users hanging for minutes is relatively new. We really need to understand the specific query which is taking so long, in addition to this patch and the longer-term work.

Marco Bonardo [:mak]

Assignee

Comment 35

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #32) > I'm going to try and apply this patch to my Firefox 6 tree by hand and see > if it makes a difference. I will give you a try, the above patch is conceptually fine, but after a clobber I noticed it doesn't compile (oops), so I'm finally fixing it and pushing it to try.

Marco Bonardo [:mak]

Assignee

Comment 36

•

13 years ago

Attached patch patch v1.1 (obsolete) (deleted) — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=4e7f4965573f http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-4e7f4965573f

Attachment #561203 - Attachment is obsolete: true

Attachment #561203 - Flags: feedback?(dtownsend)

Attachment #561576 - Flags: review?(bugmail)

Marco Bonardo [:mak]

Assignee

Updated

•

13 years ago

Attachment #561576 - Flags: review?(dtownsend)

Justin Lebar (not reading bugmail)

Comment 37

•

13 years ago

> It's not a suspect. The underlying problem is that Places is using both synchronous and asynchronous > Storage APIs (unfortunately this happens for other components too), and that by itself creates > contention. I'm confused. Supposing the sync API didn't exist, wouldn't this mystery long-running query still block all other async queries? So instead of locking up the whole UI, we'd see things like the awesomebar and maybe favicons not working for minutes at a time?

Christopher Blizzard (:blizzard)

Reporter

Comment 38

•

13 years ago

I still get the hang running with a build with this patch anyway. Just caught it in the debugger again. So it doesn't do much for the problem I'm having. Not sure if this is relevant information or not, but the sql trace running had this right before the hang started: 2864[a829098]: sqlite3_trace on a64a2d8 for 'BEGIN IMMEDIATE' 2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed = 1316669137394000 WHERE name = 'SSID' AND host = '.mozilla.com' AND path = '/' ' 2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed = 1316669137394000 WHERE name = 'SSRT' AND host = '.mozilla.com' AND path = '/' ' 2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed = 1316669137394000 WHERE name = 'wtspl' AND host = '.mozilla.com' AND path = '/ '' 2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed = 1316669137394000 WHERE name = '__qca' AND host = '.mozilla.com' AND path = '/ '' 2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed = 1316669137394000 WHERE name = 'WT_FPC' AND host = '.mozilla.com' AND path = ' /'' 2864[a829098]: sqlite3_trace on a64a2d8 for 'COMMIT TRANSACTION' 0[20c6a0]: sqlite3_trace on 59f9910 for 'UPDATE moz_hosts SET permission = 1, ex pireType= 2, expireTime = 1316755537556 WHERE id = 28' 0[20c6a0]: Resetting statement: 'UPDATE moz_hosts SET permission = ?2, expireTyp e= ?3, expireTime = ?4 WHERE id = ?1'

Marco Bonardo [:mak]

Assignee

Comment 39

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #37) > I'm confused. Supposing the sync API didn't exist, wouldn't this mystery > long-running query still block all other async queries? So instead of > locking up the whole UI, we'd see things like the awesomebar and maybe > favicons not working for minutes at a time? It depends, the awesomebar uses a separate connection and thanks to the WAL it can read at any time even if any other connection is writing. We have 4 connections exactly to improve responsiveness.

Marco Bonardo [:mak]

Assignee

Comment 40

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #38) > I still get the hang running with a build with this patch anyway. Just > caught it in the debugger again. So it doesn't do much for the problem I'm > having. Any possibility I may privately get a copy of the profile you are using for these tests?

Marco Bonardo [:mak]

Assignee

Comment 41

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #38) > 2864[a829098]: sqlite3_trace on a64a2d8 for 'BEGIN IMMEDIATE' > 0[20c6a0]: Resetting statement: 'UPDATE moz_hosts SET permission = ?2, > expireTyp > e= ?3, expireTime = ?4 WHERE id = ?1' I don't think these are related, the cookies operations are completely async there, the permission manager is completely sync instead, but is doing a simple update and has synchronous=off so should be quite fast since doesn't care about durability. Do you have the queries logged just after these? which was the stack, I don't think it may be the same as before since the patch should at least fix that specific case.

Andrew Sutherland [:asuth] (he/him)

Comment 42

•

13 years ago

Comment on attachment 561576 [details] [diff] [review] patch v1.1 r=asuth for teeny storage bit

Attachment #561576 - Flags: review?(bugmail) → review+

Christopher Blizzard (:blizzard)

Reporter

Comment 43

•

13 years ago

I didn't really want to share my profile but if it would help I can try to put it somewhere. Also I've been running the 7 beta all day and I've seen no hangs so far.

Arjen

Comment 44

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #43) > Also I've been running the 7 beta all day and I've seen no hangs so far. I just installed 7b06, but I did get a hang after a while. Similar in length and behavior, so I doubt its really fixed in 7.

Christopher Blizzard (:blizzard)

Reporter

Comment 45

•

13 years ago

I've had it happen in 7 now, but it's far less than it was in 6. Also? Turns out this happens on my wife's machine. She has a mac, with her own profile. But it does beachball at 10 minute intervals. Not always every 10 minutes, but definitely on the interval.

Justin Lebar (not reading bugmail)

Comment 46

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #39) > (In reply to Justin Lebar [:jlebar] from comment #37) > > I'm confused. Supposing the sync API didn't exist, wouldn't this mystery > > long-running query still block all other async queries? So instead of > > locking up the whole UI, we'd see things like the awesomebar and maybe > > favicons not working for minutes at a time? > > It depends, the awesomebar uses a separate connection and thanks to the WAL > it can read at any time even if any other connection is writing. We have 4 > connections exactly to improve responsiveness. So if database reads aren't blocked by database writes, why exactly are we blocking here? Does the API we use impose additional synchronization on top of that provided by the database? If so, why? I understand that we should be using async APIs everywhere, but I don't understand why access to the database should be blocked for this long.

Arjen

Comment 47

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #45) > I've had it happen in 7 now, but it's far less than it was in 6. Than you're lucky, I don't see much difference with my 7 vs what I experienced in 6. I.e. 7 hangs as often as 6 did. (In reply to Justin Lebar [:jlebar] from comment #46) > So if database reads aren't blocked by database writes, why exactly are we > blocking here? Does the API we use impose additional synchronization on top > of that provided by the database? If so, why? > > I understand that we should be using async APIs everywhere, but I don't > understand why access to the database should be blocked for this long. Given that a complete rewrite ('vacuum' from the sqlite-prompt) of the places.sqlite only takes a second or so, I don't even understand how it would be possible to have database operations that consume this much time. Unless there is some periodical job (like the vacuum) that can only execute if all connections are closed. But those connections are pooled and only time out after several seconds of non-usage... But as I still have no idea how to make Firefox log the sql-operations on my end, I won't be able to help out with that much. By the way, while typing this text firefox hung on me again, it took about 3 minutes for it to respond again...

Arjen

Comment 48

•

13 years ago

I've monitored it a bit more, it turns out it takes about 9 minutes (not 10) to re-occur and lasts about 3 minutes than. I.e. it lasted from 11:25-11:28, 11:34-11:37, 11:43-11:46 and so on. So basically, one third of the time, firefox is unresponsive :X

Andrew Sutherland [:asuth] (he/him)

Comment 49

•

13 years ago

(In reply to Arjen from comment #47) > But as I still have no idea how to make Firefox log the sql-operations on my > end, I won't be able to help out with that much. You need a debug build or to force PR_LOGGING to be defined during the build of the storage component; then the logging environment variables you used before will have an effect. Yes, it would be possible to make the logging always built-in with minimal overhead with a patch. Right now defining PR_LOGGING causes the sqlite3_trace mechanism to be invoked unconditionally. I think you can find a debug build here (picking a recent dir in the subdir tree); change up the path if it's not the right thing: https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/

Arjen

Comment 50

•

13 years ago

(In reply to Andrew Sutherland (:asuth) from comment #49) > I think you can find a debug build here (picking a recent dir in the subdir > tree); change up the path if it's not the right thing: > https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla- > central-win32-debug/ Thanks for the pointer, but those builds require a debug version of the VC80 redistributable as well. And that seems to require a full Visual Studio install, and I'd rather not install environments like that on my personal desktop computer... Is there no other way to get the query logging enabled?

Shawn Wilsher :sdwilsh

Comment 51

•

13 years ago

Comment on attachment 561576 [details] [diff] [review] patch v1.1 Review of attachment 561576 [details] [diff] [review]: ----------------------------------------------------------------- r=sdwilsh on everything but the add-on manager stuff.

Attachment #561576 - Flags: review+

Marco Bonardo [:mak]

Assignee

Comment 52

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #46) > So if database reads aren't blocked by database writes, why exactly are we > blocking here? in the specific reported stack is a mutex protecting thread safety in Storage. > I understand that we should be using async APIs everywhere, but I don't > understand why access to the database should be blocked for this long. It should not, thus the bug. No query should take more than a few milliseconds, they have been built with that in mind. Broken data (like a corrupt index or a missing ANALYZE run) or unexpected data distribution may cause that to fail though. My problem is that I need a profile where this happens, since I've never seen a hang even in the large hundreds MB databases I have for testing, nor my default profile that is well over 100MB of data. There must be a specific case due to the data distribution or broken data that activates the behavior. So anyone able to reproduce these hangs, please, I really need your profile, or at least your places.sqlite database (provided it ends up being Places) by some private mean (either mail or mail a link to download a compressed and encrypted one). I also have some additional questions that may help me: - does Nightly improve or kills these hangs? - may you give me the stats generated by this add-on (flag only "statistics") https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/

Arjen

Comment 53

•

13 years ago

Here are my stats: > Statistics Database size is 20480 KiB user_version is 11 page_size is 4096 cache_size is 124393 journal_mode is wal synchronous is 1 History can store a maximum of 254758 unique pages Table moz_bookmarks has 192 records Table moz_bookmarks_roots has 5 records Table moz_keywords has 0 records Table sqlite_sequence has 0 records Table moz_favicons has 1125 records Table moz_annos has 154 records Table moz_anno_attributes has 8 records Table moz_items_annos has 64 records Table moz_places has 15647 records Table moz_historyvisits has 156543 records Table moz_inputhistory has 126 records Table sqlite_stat1 has 9 records Index sqlite_autoindex_moz_bookmarks_roots_1 Index sqlite_autoindex_moz_keywords_1 Index sqlite_autoindex_moz_favicons_1 Index sqlite_autoindex_moz_anno_attributes_1 Index sqlite_autoindex_moz_inputhistory_1 Index moz_bookmarks_itemindex Index moz_bookmarks_parentindex Index moz_bookmarks_itemlastmodifiedindex Index moz_places_faviconindex Index moz_places_hostindex Index moz_places_visitcount Index moz_places_frecencyindex Index moz_historyvisits_placedateindex Index moz_historyvisits_fromindex Index moz_historyvisits_dateindex Index moz_places_lastvisitdateindex Index moz_annos_placeattributeindex Index moz_items_annos_itemattributeindex Index moz_places_url_uniqueindex Index moz_bookmarks_guid_uniqueindex Index moz_places_guid_uniqueindex Trigger moz_bookmarks_beforedelete_v1_trigger I also ran the integrity and coherence checks, and all seems well. I'll check the Nightly shortly.

Andrew Sutherland [:asuth] (he/him)

Comment 54

•

13 years ago

(In reply to Arjen from comment #50) > Is there no other way to get the query logging enabled? I spun a try build wherein I think I enabled PR_LOGGING in mozStorage: https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-6edd9b6f035f/try-win32/ It should be very similar to a nightly, too...

Justin Lebar (not reading bugmail)

Comment 55

•

13 years ago

Attached file Places stats (deleted) — Details

These are the statistics from the addon on my dad's Win XP computer. He's been seeing this hang for a few weeks now. It happens even with sync off. Can we generate a build which doesn't require the MSVC debug libraries which will dump some information when the db is locked for more than X seconds? > My problem is that I need a profile where this happens I'll try to clean the private data out of my dad's profile and send that to you. > in the specific reported stack is a mutex protecting thread safety in Storage. Is this mutex held while accessing the database? Shouldn't we drop all locks before querying the DB, since the DB does its own locking? Otherwise we've defeated all the reads-proceed-concurrently-with-writes business sqlite is trying to give us.

Andrew Sutherland [:asuth] (he/him)

Comment 56

•

13 years ago

(In reply to Andrew Sutherland (:asuth) from comment #54) > (In reply to Arjen from comment #50) > > Is there no other way to get the query logging enabled? > > I spun a try build wherein I think I enabled PR_LOGGING in mozStorage: It looks like I screwed this up. I'll check with a local build and then spin another (optimized) try build. (In reply to Justin Lebar [:jlebar] from comment #55) > Is this mutex held while accessing the database? Shouldn't we drop all > locks before querying the DB, since the DB does its own locking? Otherwise > we've defeated all the reads-proceed-concurrently-with-writes business > sqlite is trying to give us. Storage actually uses the SQLite database connection mutex: http://mxr.mozilla.org/mozilla-central/source/storage/src/SQLiteMutex.h The upside is that in cases where the mutex will be rapidly acquired and released in a cycle, we can avoid a lot of churn/contention by just holding it for a longer aggregate period (and thereby potentially reducing the introduction of new types of exciting races).

Brendan Eich [:brendan]

Comment 57

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #52) > (In reply to Justin Lebar [:jlebar] from comment #46) > > So if database reads aren't blocked by database writes, why exactly are we > > blocking here? > > in the specific reported stack is a mutex protecting thread safety in > Storage. What mutex is that, can you give some mxr links? Why does it need to be held across i/o? That should be avoided, always. /be

Justin Lebar (not reading bugmail)

Updated

•

13 years ago

Depends on: 689229

Arjen

Comment 58

•

13 years ago

By the way, last night's Nightly "9.0a1 (2011-09-26)" didn't improve things. If anything, it may even have a slightly faster re-occurence (8-8.5 minutes rather than the 9 minutes I mentioned for 7b06) and a slightly longer period of hangs (a bit more than 3 minutes). But that may also be just my impatience for waiting to finish the timing :-)

Andrew Sutherland [:asuth] (he/him)

Comment 59

•

13 years ago

People experiencing the hang bug, are you using any type of anti-virus? If so, which one? Also, a proper try build for win32 claims to be 86 minutes away from finishing baking. It will be at https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-03ac75ee927f/ when done

Andrew Sutherland [:asuth] (he/him)

Comment 60

•

13 years ago

(In reply to Andrew Sutherland (:asuth) from comment #59) > People experiencing the hang bug, are you using any type of anti-virus? If > so, which one? Also, a proper try build for win32 claims to be 86 minutes > away from finishing baking. It will be at > https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ > bugmail@asutherland.org-03ac75ee927f/ when done Okay, the optimized try build is now up here: https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-03ac75ee927f/try-win32/ I have confirmed that doing: set NSPR_LOG_FILE=%TEMP%\log.txt set NSPR_LOG_MODULES=mozStorage:5 then running it works.

Andreas Gal :gal

Comment 61

•

13 years ago

Two things terrify me about this bug, and none of them have to do with places: 1. Our test infrastructure doesn't catch this. We are clearly not exercising the code during tests the way we exercise it in production. This is something we have to get on top of right away. I suggest we pair up one of Bob's amazing automation wizards with someone who knows places and we come up with a stress test that exposes this issues during a test run with the code as-is. 2. We have no visibility into hangs out in the fields. Looks like this bug hits users without extensions every 10 minutes (according to Blizzard). And we don't know about it! Without visibility we can't prioritize and triage these bugs (sounds like we knew about this problem, we just didn't know it was this widespread). We need a bug for an automated watchdog, and we need it fixed asap. If I can't find one, I will file a new one.

Andreas Gal :gal

Updated

•

13 years ago

Depends on: hang-detector

Christopher Blizzard (:blizzard)

Reporter

Comment 62

•

13 years ago

Mak, if you find me tomorrow I can get you a .zip of my profile, which hangs reliably on windows with Firefox 6.0.2.

Arjen

Comment 63

•

13 years ago

(In reply to Andrew Sutherland (:asuth) from comment #60) > Okay, the optimized try build is now up here: > https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ > bugmail@asutherland.org-03ac75ee927f/try-win32/ > > I have confirmed that doing: > set NSPR_LOG_FILE=%TEMP%\log.txt > set NSPR_LOG_MODULES=mozStorage:5 > > then running it works. And indeed it works just fine. Unfortunately, I don't see much activity in the log while its hanging and the query just before look to be doing just fine. But I'm running out of time just now, so I'll have a better look tonight. About virus scanning, I'm running Avira AntiVir Personal. I can try if disabling it helps. Btw, about the log, it might be useful to add some sort of timestamp into it. Now it just logs the statements with some connection and statement indications (I think), but I'm not certain if any of the other aspects is actually useful in relating the executed statements to the timestamps the hangs occurred. As said, I'll have a longer look at it tonight.

Marco Bonardo [:mak]

Assignee

Comment 64

•

13 years ago

These are 2 test builds on current Nightly where I've modified a couple things I have suspects on: the former avoids expiration notifications during handleResults (ideally should not make a difference, but since we are out of bounds here is worth a try), the latter also doesn't run ANALYZE (that causes reparsing of prepared statements). Both include the patch in this bug. I'd like to know if they make any difference for users able to reproduce the hang. build 1: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-506df90a2c98/ build 2: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-0054738ddc24/

Marco Bonardo [:mak]

Assignee

Comment 65

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #62) > Mak, if you find me tomorrow I can get you a .zip of my profile, which hangs > reliably on windows with Firefox 6.0.2. I'll be on irc, as usual. (In reply to Andreas Gal :gal from comment #61) > 1. Our test infrastructure doesn't catch this. We are clearly not exercising > the code during tests the way we exercise it in production. We don't know the cause yet, and it affects only some users. If it's hard to reproduce, it's also hard to make a test for it. Reproducing the production conditions may be nearly impossible considering all the possible hardware/software/profile configurations of the users and those matter. But for sure the infrastructure is missing a real browsing-without-hangs test. > 2. We have no visibility into hangs out in the fields. Looks like this bug > hits users without extensions every 10 minutes (according to Blizzard). And > we don't know about it! Without visibility we can't prioritize and triage > these bugs (sounds like we knew about this problem, we just didn't know it > was this widespread). We still don't know, we have 3 or 4 reports here, and they may even be due to different causes (for example the first stack is due to a wrong CreateStatement call, but looks like it's not the only one). We should reports hangs with full stacks like we do with crashes.

Andreas Gal :gal

Comment 66

•

13 years ago

> We still don't know, we have 3 or 4 reports here, and they may even be due > to different causes (for example the first stack is due to a wrong > CreateStatement call, but looks like it's not the only one). We should > reports hangs with full stacks like we do with crashes. I attached a patch to bug 429592 to do that

Arjen

Comment 67

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #65) > We still don't know, we have 3 or 4 reports here, and they may even be due > to different causes (for example the first stack is due to a wrong > CreateStatement call, but looks like it's not the only one). We should > reports hangs with full stacks like we do with crashes. Not every one knows about bugzilla, for instance, this support-question may very well be the same issue: http://support.mozilla.com/nl/questions/876938 And there are 59 others who claim it affects them as well. Here's one with 580 users: http://support.mozilla.com/nl/questions/872283 And one with 162: http://support.mozilla.com/nl/questions/872360 If you search more in the support-section, you'd probably find more reports that appear to be similar.

Marco Bonardo [:mak]

Assignee

Comment 68

•

13 years ago

(In reply to Arjen from comment #67) > Not every one knows about bugzilla, for instance, this support-question may > very well be the same issue: No doubt, as it may be a completely different issue, unless we consider all browser hangs and freezes due to this bug. My point was that it's hard to relate hangs to reasons, not that we don't have hangs. Btw, anyone willing to test comment 64 builds and send me profiles privately is welcome!

Marco Bonardo [:mak]

Assignee

Comment 69

•

13 years ago

Apart testing the above 2 builds, may you also try setting places.history.expiration.interval_seconds to 3600 in your current version and see if hangs disappear? Remember to restore the pref once done.

Justin Lebar (not reading bugmail)

Comment 70

•

13 years ago

Still happening to my dad on WinXP with virus scanner disabled.

Arjen

Comment 71

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #69) > Apart testing the above 2 builds, may you also try setting > places.history.expiration.interval_seconds to 3600 in your current version > and see if hangs disappear? > Remember to restore the pref once done. With this adjusted parameter my Firefox 7b06 was able to browse for at least 36 minutes without any hang. I was bored enough to test it any longer, so I didn't wait.

Arjen

Comment 72

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #64) > build 1: > http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net- > 506df90a2c98/ This one seems to fix the problem. Or at least it didn't reoccur for at least half an hour. > build 2: > http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net- > 0054738ddc24/ I haven't tested this one, would you like me to?

Arjen

Comment 73

•

13 years ago

Attached file Snippet from SQLite log during hang (deleted) — Details

I've added a partial log of the statements that where in the log-file during one of those hangs (with the build from comment 60). As can be seen at the end of the log, there are several deletes on moz_places, moz_annos and a few more. During a second hang those queries where written to the log, but right after my browser responded again, the entries where added to it. So it is probably a good start to investigate. Most of those deletes are fairly complicated and - although I have mainly MySQL and PostgreSQL-experience - look to me as probable slow queries. Given the size of the databases and the fact that I have a fairly new and high-end system I doubt its the I/O. Although some missing index might make things cpu-heavy if all data is already in memory.

Arjen

Comment 74

•

13 years ago

I had a look at the query plans of some of those queries using the sqlite3 shell. The first one hit the jackpot: This one: SELECT h.id, h.url, h.guid, h.last_visit_date, :limit_uris FROM moz_places h LEFT JOIN moz_historyvisits v ON h.id = v.place_id LEFT JOIN moz_bookmarks b ON h.id = b.fk WHERE v.id IS NULL AND b.id IS NULL AND h.ROWID <> IFNULL(:null_skips_last, (SELECT MAX(ROWID) FROM moz_places)) LIMIT :limit_uris (I substituted 100 for ':limit_uris' and 0 for ':null_skips_last') resulted in this plan: 0|0|0|SCAN TABLE moz_places AS h (~81 rows) 0|0|0|EXECUTE SCALAR SUBQUERY 1 1|0|0|SEARCH TABLE moz_places USING INTEGER PRIMARY KEY (~1 rows) 0|1|1|SCAN TABLE moz_historyvisits AS v (~1 rows) 0|2|2|SEARCH TABLE moz_bookmarks AS b USING COVERING INDEX moz_bookmarks_itemlastmodifiedindex (fk=?) (~2 rows) As you can see it expected moz_historyvisits to only have 1 row, rather than the actual 150k rows... And thus it choose a rather inefficient table scan, instead of a more useful index search. So I ran analyze from the shell, and after that, this plan emerged: 0|0|0|SCAN TABLE moz_places AS h (~81 rows) 0|0|0|EXECUTE SCALAR SUBQUERY 1 1|0|0|SEARCH TABLE moz_places USING INTEGER PRIMARY KEY (~1 rows) 0|1|1|SEARCH TABLE moz_historyvisits AS v USING COVERING INDEX moz_historyvisits_placedateindex (place_id=?) (~2 rows) 0|2|2|SEARCH TABLE moz_bookmarks AS b USING COVERING INDEX moz_bookmarks_itemlastmodifiedindex (fk=?) (~2 rows) So now it uses the index, as expected. Unless Firefox manages to mess up those statistics again, I'm fairly certain this fixes not only this specific query, but some of the others that it uses in the deletes from the logs as well. For now it seems my original firefox 7b06 doesn't exhibit the behavior anymore now those table-statistics are in order... Perhaps you guys can add periodic analyze's to tables or databases that see heavy updates (or inserts/deletes). Adding periodic vacuum's probably might be a good idea as well (if firefox doesn't do that already), although I have no idea what kind of performance gain or impact that has.

Arjen

Comment 75

•

13 years ago

Obviously the statistics for moz_places where also wrong (there actually almost 8k records), but the table scan was still the correct plan, so that didn't matter much here.

Marco Bonardo [:mak]

Assignee

Comment 76

•

13 years ago

(In reply to Arjen from comment #74) > As you can see it expected moz_historyvisits to only have 1 row, rather than > the actual 150k rows... And thus it choose a rather inefficient table scan, > instead of a more useful index search. Those numbers are not the number of rows in the table, those are estimates about the number of rows that may be left to evaluate. My question is how much time sqlite shell takes to run the query before and after analyze in your case (to simulate the "before" you can DROP sqlite_stat1 and sqlite_stat2 tables). Will do the same experiment here. > So I ran analyze from the shell, and after that, this plan emerged: > Perhaps you guys can add > periodic analyze's to tables or databases that see heavy updates What's interesting is that we already do that. What may differ here is that most likely you ran analyze with a sqlite shell compiled with STAT2, while we just use STAT1 for privacy reasons. Also, I think we don't run it often enough, this is known bug 683876, that I'll immediately prioritize and take. > Adding periodic vacuum's probably might be a good idea as well (if firefox > doesn't do that already) We do already, on idle once a month.

Depends on: 683876

Marco Bonardo [:mak]

Assignee

Comment 77

•

13 years ago

A question. Are you usually never leaving the browser open unused? The fact you had broken analyze results make me think you hardly hit idle, we have quite some idle cleanup tasks.

Arjen

Comment 78

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #76) > Those numbers are not the number of rows in the table, those are estimates > about the number of rows that may be left to evaluate. I understood that, but for a full table scan they're normally (at least in postgres and mysql) somewhat similar. > My question is how much time sqlite shell takes to run the query before and > after analyze in your case (to simulate the "before" you can DROP > sqlite_stat1 and sqlite_stat2 tables). > Will do the same experiment here. That didn't work, but I picked yesterday's version from a backup. The specific select took 280 cpu seconds according to the shell's timer. Wall-clock wise that sounds about right as well. After the analyze, the query dove to just 0.0468 cpu seconds. > What's interesting is that we already do that. What may differ here is that > most likely you ran analyze with a sqlite shell compiled with STAT2, while > we just use STAT1 for privacy reasons. > Also, I think we don't run it often enough, this is known bug 683876, that > I'll immediately prioritize and take. Actually, there was no sqlite3_stat2 or sqlite_stat2, but I can double check if you'd like me to. Apart from that, I've no idea how often the analyze is run, but given that my places-db had a mismatch of about 100x (assuming the initial estimate of 81 and later estimate of 7998 are somewhat close to the actual sizes), I doubt it has run for some time... Still, Firefox 5 had no issues with this behaviour, so either something changed to make those queries worse, or something changed to make it more pronounced. > We do already, on idle once a month. Ok, I did manual vacuum's and those didn't help anyway. So obviously, vacuum doesn't (re)do analyze. (In reply to Marco Bonardo [:mak] from comment #77) > A question. Are you usually never leaving the browser open unused? The fact > you had broken analyze results make me think you hardly hit idle, we have > quite some idle cleanup tasks. I commonly walk away from my PC to watch TV or do other things that take over half an hour without me touching said PC, and if my browser happens to be open, I leave it open. I normally don't close it much during the course of the PC's on-time. So unless something during that 'idle time' prohibits Firefox from actually considering itself idle... the answer would be 'no'.

Brendan Eich [:brendan]

Comment 79

•

13 years ago

(In reply to Brendan Eich [:brendan] from comment #57) > (In reply to Marco Bonardo [:mak] from comment #52) > > (In reply to Justin Lebar [:jlebar] from comment #46) > > > So if database reads aren't blocked by database writes, why exactly are we > > > blocking here? > > > > in the specific reported stack is a mutex protecting thread safety in > > Storage. > > What mutex is that, can you give some mxr links? Why does it need to be held > across i/o? That should be avoided, always. Mak, can you provide some links please? It is a bad bug, which should be filed and fixed quickly, to hold a mutex across i/o. /be

Justin Lebar (not reading bugmail)

Comment 80

•

13 years ago

That mutex is the sqlite per-connection mutex we looked at earlier. I think the solution here is to not share connections between threads. Then we'll never block on the mutex.

Marco Bonardo [:mak]

Assignee

Comment 81

•

13 years ago

(In reply to Arjen from comment #78) > That didn't work yeah sorry, I forgot you can't drop sqlite_stat# tables, you can delete everything from them though. > , but I picked yesterday's version from a backup. The > specific select took 280 cpu seconds according to the shell's timer. > After the analyze, the query dove to just 0.0468 cpu seconds. ok, that's pretty bad. Any way I could get a compressed copy of that backup database by mail? Otherwise I'd just like to get the content of sqlite_stat1 table before and after analyze. > I doubt it has run for some time... Still, Firefox 5 had no > issues with this behaviour, so either something changed to make those > queries worse, or something changed to make it more pronounced. I took a look at the list of changes in Firefox 6, but nothing jumped to my eyes. > > We do already, on idle once a month. > Ok, I did manual vacuum's and those didn't help anyway. So obviously, vacuum > doesn't (re)do analyze. Vacuum doesn't analyze, that's correct. > (In reply to Marco Bonardo [:mak] from comment #77) > I commonly walk away from my PC to watch TV or do other things that take > over half an hour without me touching said PC So, this is pretty strange, our analyze calls should be invoked during those idles, unless something broke idle or idle expiration. Will check that too. (In reply to Brendan Eich [:brendan] from comment #79) > Mak, can you provide some links please? It is a bad bug, which should be > filed and fixed quickly, to hold a mutex across i/o. As said, Storage just gets the mutex from SQLite (http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.cpp#537) and uses it internally to avoid having mutexes races. Storage allows 2 concurrent threads on the same connection (sync and async APIs), and uses that single mutex to drive them. The issue may happen when we cross the line between the two APIs. Asuth explained this really well here and in bug 689229. At least now we have identified a stack that may cause a contention (with patch here) and the query that is wrongly slow, we still have to identify the reason analyze fails for some users (bug 683876). I agree it's scary that a long query may hang the ui so badly, but it's probably scarier that we don't detect these cases in the wild.

Brendan Eich [:brendan]

Comment 82

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #81) > As said, Storage just gets the mutex from SQLite > (http://mxr.mozilla.org/mozilla-central/source/storage/src/ > mozStorageConnection.cpp#537) and uses it internally to avoid having mutexes > races. What races? What invariants does this mutex protect? > Storage allows 2 concurrent threads on the same connection (sync and > async APIs), and uses that single mutex to drive them. The issue may happen > when we cross the line between the two APIs. Asuth explained this really > well here and in bug 689229. The issue is a bug that needs to be fixed. Even if the line is not crossed, holding a mutex across i/o is a bug on its face. Is it on file? /be

Shawn Wilsher :sdwilsh

Comment 83

•

13 years ago

(In reply to Brendan Eich [:brendan] from comment #82) > What races? What invariants does this mutex protect? Note that this is the same mutex SQLite uses to serialize access to the connection. http://mxr.mozilla.org/mozilla-central/ident?i=sharedDBMutex Used to: - protect our "in transaction" state from changing without us being aware of it - protect `Connection::mFunctions` from being accessed (read/write) on different threads - protect `Connection::mProgressHandler` from being accessed (read/write) on different threads http://mxr.mozilla.org/mozilla-central/ident?i=mDBMutex While executing an asynchronous statement, it is used to: - prevent `::sqlite3_errmsg` from changing while we step on a statement because the main thread tried to use the connection between us calling step and getting the error. This is only slightly expanding the time in which we hold the lock from SQLite. > The issue is a bug that needs to be fixed. Even if the line is not crossed, > holding a mutex across i/o is a bug on its face. Is it on file? No, and like I said before [run-up to Firefox 3.6, run-up to Firefox 4.0 (bug 563538 + dependencies), now (bug 689229 comment 16] this is only a problem because of SQLite's longstanding serialization of access to the database and how we use it. We can't just make everything async at once (we've been working on Places since 3.6 and it still isn't done), so we end up in the state we are in.

Marco Bonardo [:mak]

Assignee

Updated

•

13 years ago

Blocks: 666128

Jonathan Guerin

Comment 84

•

13 years ago

I'd love to know when this bug gets fixed!

MasterLeep

Comment 85

•

13 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=679498 is probably the same bug.

Arjen

Comment 86

•

13 years ago

Attached file Contents of sqlite_stat1 before and after Analyze (deleted) — Details

I've added the contents of my sqlite_stat1, before and after the the analyze. I do recall I cleared my history a while back (I think somewhere in july, i.e. with 5.0 or so), so that might account for some of the emptiness, but still, it appears as if the tables have not been analyzed since that moment.

Marco Bonardo [:mak]

Assignee

Comment 87

•

13 years ago

(In reply to Arjen from comment #86) > Created attachment 563098 [details] > Contents of sqlite_stat1 before and after Analyze fwiw, this is REALLY nonsense: INSERT INTO "sqlite_stat1" VALUES('moz_historyvisits',NULL,'0');

Marco Bonardo [:mak]

Assignee

Comment 88

•

13 years ago

May you execute a "PRAGMA integrity_check" on your database? I wonder if analyze fails when indexes go corrupt.

Marco Bonardo [:mak]

Assignee

Comment 89

•

13 years ago

So, after playing a bit with Firefox 6 and 7, I think I have a better knowledge of why analyze didn't run often enough. Practically on clear history we run analyze, that stores low values to the stats table, since most indexes are small. At that point expiration is really conservative, for battery saving and avoiding unneeded work. If it evaluates the database as clean (not enough entries to expire, that is pretty common till user reaches the history limits) it doesn't enable aggressive expiration steps, included the one on idle. So now the database grows slowly towards history limits, without further analyze runs, at a certain point stats are so broken that queries take the wrong path. I think the right approach is to avoid ANALYZE completely and rather use pre-calculated statistics, as implemented in bug 683876. We should backport some fix at least to Firefox 8 and Firefox 9, the patch in this bug may be reduced to fix the known jumplists case for stable branches. Backporting also bug 683876 would be nice, but if too scary I may at last add an effective analyze run on idle.

Marco Bonardo [:mak]

Assignee

Comment 90

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #89) > Backporting also bug 683876 would be nice, but if too scary I may > at last add an effective analyze run on idle. Actually taking bug 690354 on other branches may be easier. So to clarify, on central I'd take this bug, bug 683876 and bug 690354, on aurora and beta I'd take a reduced version of this bug (only createStatement change) and bug 690354.

Arjen

Comment 91

•

13 years ago

I'm guessing I don't need to check anything else anymore? Its still weird this was so much more visible during Firefox 6 usage compared to 5, although perhaps the worsening of the staleness of the stats-data just happened to colide with starting to use Firefox 6... I think its a bit scary to force-feed a database some bogus stats, but I can imagine in this case the amount of possible query plans is relatively small and there are normally not many plans that actually are much worse if the stats are overstating a table's amount of records. At least not in the multi-mb sizes firefox is operating with. Still, a periodic Analyze would probably be best, but if you can't guarantee the periodic aspect... this may be the more defensive solution.

MasterLeep

Comment 92

•

13 years ago

I've experienced this problem since Firefox 5 at least - maybe 4.

Marco Bonardo [:mak]

Assignee

Comment 93

•

13 years ago

(In reply to Arjen from comment #91) > I'm guessing I don't need to check anything else anymore? Right, I figured it out. Thank you for your availability. > Its still weird this was so much more visible during Firefox 6 usage > compared to 5 Imo, it's just a problem that piles up slowly.

Sheila Mooney

Comment 94

•

13 years ago

We decided we need to track this for FF8 and FF9. It looks like we will get a fix soon.

tracking-firefox8: --- → +

tracking-firefox9: --- → +

Christopher Blizzard (:blizzard)

Reporter

Comment 95

•

13 years ago

Hi, a couple of notes based on re-reading the bug. Mak - please make sure you add a telemetry probe that tells us how often we actually analyze the database. Or, if it's very rare, at least how often we go idle. I remember that a lot of our GC problems were related to the fact that the browser actually does idle only very rarely. I don't know if the SQLite code has a different way to measure "idle" or not, but we should make sure that we're actually hitting the condition on a regular basis. If there are a lot of other services that depend on that condition being hit we need to start measuring in the field now now now. Second, think about what happened to me. I set up a new computer, created a new profile (which set up the database with the wrong stats data) and then ran sync which populated the database. This means that I get the hangs because we probably don't analyze the table after we populate it with a huge amount of data! This also means that anyone else that does this is probably affected, which is very scary. I don't know how many people are doing what I'm doing, but I can't imagine it's a small number. If we figure out that Sync is causing terrible performance problems because of this interaction then we really need to make sure we re-analyze after sync finishes the initial data load.

tracking-firefox8: + → ---

tracking-firefox9: + → ---

Christopher Blizzard (:blizzard)

Reporter

Updated

•

13 years ago

tracking-firefox8: --- → +

tracking-firefox9: --- → +

Christopher Blizzard (:blizzard)

Reporter

Comment 96

•

13 years ago

Mak, if I select All in the Places Maintenance database does it re-analyze the tables in question or not?

Marco Bonardo [:mak]

Assignee

Comment 97

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #95) > Mak - please make sure you add a telemetry probe that tells us how often we > actually analyze the database. Or, if it's very rare, at least how often we > go idle. I don't want to run ANALYZE at all in future, as implemented in bug 683876, since it's just too scary to forget to do it. Regarding idle, telemetry is sent on idle so if a user pings he is idle, if a user doesn't ping we can't know if he signed out of telemetry or doesn't go idle. we may measure the number of times a user idles along a day, but for how telemetry is designed (it pings all statistics collected till the first idle) this requires storing data locally that is something telemetry has not yet been built for (the telemetry pricacy document states collected data is volatile). > I remember that a lot of our GC problems were related to the fact > that the browser actually does idle only very rarely. Well, it depends how much rarely, once a day for Places needs may be enough. I rather think we may want something different, maybe a system service, to run maintenance on the profile even when the browser is closed. May be more reliable than idles. > This also means that anyone else that does this is probably > affected, which is very scary. Agree, this is scary, that's why I don't want to analyze anymore. (In reply to Christopher Blizzard (:blizzard) from comment #96) > Mak, if I select All in the Places Maintenance database does it re-analyze > the tables in question or not? Selecting "expire" option is enough.

Brendan Eich [:brendan]

Comment 98

•

13 years ago

(In reply to Shawn Wilsher :sdwilsh from comment #83) > > The issue is a bug that needs to be fixed. Even if the line is not crossed, > > holding a mutex across i/o is a bug on its face. Is it on file? > No, and like I said before [run-up to Firefox 3.6, run-up to Firefox 4.0 > (bug 563538 + dependencies), now (bug 689229 comment 16] this is only a > problem because of SQLite's longstanding serialization of access to the > database and how we use it. We do not need to use the mutexes SQLite abuses just because it exposes them in its API, though. Let's go through the invariants, which I do not understand yet -- please educate me and others reading here. > - protect our "in transaction" state from changing without us being aware of it This does not need a mutex acquisition on the main thread, though. Is this a short critical section? It should be, unless you mean the main thread has to keep this state fixed against mutation from a non-main thread -- that would be bad on its face, janky if not hang-y. > - protect `Connection::mFunctions` from being accessed (read/write) on different threads > - protect `Connection::mProgressHandler` from being accessed (read/write) on different threads These must be short critical sections too. So the problem is that SQLite abuses its mutexes by holding them for potentially very long critical sections, and our code on the main thread then gets hung on the same mutexes. That suggests us avoiding using these mutexes at all. > http://mxr.mozilla.org/mozilla-central/ident?i=mDBMutex > While executing an asynchronous statement, it is used to: > - prevent `::sqlite3_errmsg` from changing while we step on a statement because the main thread tried to use the connection between us calling step and getting the error. This is only slightly expanding the time in which we hold the lock from SQLite. But if SQLite holds the mutex and goes out to lunch, then we can hang, right? > We can't just make everything async at once > (we've been working on Places since 3.6 and it still isn't done), so we end > up in the state we are in. Why can't we put all the nasty sync'ing on background threads we communicate with using our own postMessage-style queuing? Then we will still have problems when SQLite goes out to lunch and the queue fills or a response is needed, but we won't hang all of the UI. I don't see why it takes years to stop abusing SQLite mutexes on the main thread. If we married the SQLite API, it could be there's a lot of grunt work remoting with async message passing. But is it really a multi-year mission? /be

MasterLeep

Comment 99

•

13 years ago

It's really 683876 that should be tracked for Firefox 8, right?

Marco Bonardo [:mak]

Assignee

Comment 100

•

13 years ago

nope, see comment 90

Jonathan Guerin

Comment 101

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #95) > Hi, a couple of notes based on re-reading the bug. > > If we figure out that Sync is causing terrible performance problems because > of this interaction then we really need to make sure we re-analyze after > sync finishes the initial data load. This is the exact way I set up my browser, and I have hang problems all the time. :-( PS I'm not a Firefox developer, so I don't know if that comment will be much use, but I'd be happy to supply more data if you need it.

Christopher Blizzard (:blizzard)

Reporter

Comment 102

•

13 years ago

Brendan & Shawn, can you guys please take the architecture to .platform? I want to keep this bug about how we fix the bad data that we're creating.

Christopher Blizzard (:blizzard)

Reporter

Comment 104

•

13 years ago

Since I installed the Places Maintenance add-on and ran the expire operation I haven't seen a long hang. Yay!

Scoobidiver (away)

Updated

•

13 years ago

No longer blocks: 666128

Marco Bonardo [:mak]

Assignee

Updated

•

13 years ago

OS: Windows 7 → All

Hardware: x86_64 → All

Michail Pappas

Comment 106

•

13 years ago

I am the author of bug 688521. Thought I'd drop by in case feedback is needed and provide some info on my setup. For the record, issue appears on a new installation of FF 6.0.2 (Windows 7, 64-bit), after it gets firefox sync'ed with my sync account. Please note that I do not experience a cycle of 9' as described here. It's something like 60" freeze followed by 60" of ok operation. See my description in bug 688521 for cpu graphs, stack when this happens etc. (In reply to Andrew Sutherland (:asuth) from comment #59) > People experiencing the hang bug, are you using any type of anti-virus? If > so, which one? NOD32 Antivirus, ver. 4.2.something (64-bit) (In reply to Christopher Blizzard (:blizzard) from comment #95) > Second, think about what happened to me. I set up a new computer, created a > new profile (which set up the database with the wrong stats data) and then > ran sync which populated the database. This means that I get the hangs > because we probably don't analyze the table after we populate it with a huge > amount of data! This also means that anyone else that does this is probably > affected, which is very scary. I don't know how many people are doing what > I'm doing, but I can't imagine it's a small number. Not a coder here, but it feels like this in an on-spot observation. Also matches my very own experience and possibly explains why my other much-less-powered firefox sync'ed computers do not exhibit this behavior, whereas the most powerful, albeit with a new installation, does... I expect to be back in Wednesday, meaning I would have access to the problematic platform only by then. Since I still have got 6.0.2 installed, would you like me to perform some test before upgrading to 7? I thought I'd install the places maintainance extension and do an expire operation. Or do you want me to produce a debug log when the issue appears? Please advise.

MasterLeep

Comment 107

•

13 years ago

(In reply to Christopher Blizzard (:blizzard) from comment #104) > Since I installed the Places Maintenance add-on and ran the expire operation > I haven't seen a long hang. Yay! That is my experience as well. It feels like a brand new browser. Since three independent installations out of three that I know got affected by this bug, and we don't use Sync or anything weird, I would urge you to get this fix out there as soon as possible.

carl@brannenworks.com

Comment 108

•

13 years ago

This has been bugging me for months. Around 50% of the time Firefox locks up and uses 50% CPU time. Fixes itself after a few minutes but only lasts a few minutes before doing it again. I'll add explicit instructions for people like me who don't understand the buzz words. A few hours ago I installed the extension "Places Maintenance": https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/ Opened up the add-ons manager. Found "Places Maintenance". Clicked "options". Selected "All" from the drop down menu. Clicked "Execute". Problem solved.

Mats Palmgren (inactive)

Updated

•

13 years ago

Blocks: 691146

Jonathan Guerin

Comment 110

•

13 years ago

I have run the 'Expire' option using the Places Maintenance and have also observed no lockups in the last couple of hours. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1

Marco Bonardo [:mak]

Assignee

Updated

•

13 years ago

Blocks: PlacesJank

Marco Bonardo [:mak]

Assignee

Comment 111

•

13 years ago

since this bug is about some specific stack in createStatement, I filed meta bug 691507 to track work and bring on the discussion. I'll now split the patch here and land the reviewed part, filing a new bug for the addons manager part.

Marco Bonardo [:mak]

Assignee

Comment 112

•

13 years ago

Attached patch patch v1.2 (deleted) — Details — Splinter Review

I've splitted the addons manager bits to bug 691512, so now I can land this.

Attachment #561576 - Attachment is obsolete: true

Attachment #561576 - Flags: review?(dtownsend)

Marco Bonardo [:mak]

Assignee

Comment 113

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/347b57a14ce3

Matt Brubeck (:mbrubeck)

Comment 114

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/347b57a14ce3

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla10

Marco Bonardo [:mak]

Assignee

Comment 115

•

13 years ago

Comment on attachment 564337 [details] [diff] [review] patch v1.2 This is the first piece I'd like to backport to 8 and 9. The change is contained and safe enough imo, just moving some possibly-blocking calls to non-blocking calls. To fix the hangs we'll also need bug 690354 and bug 691509 though.

Attachment #564337 - Flags: approval-mozilla-beta?

Attachment #564337 - Flags: approval-mozilla-aurora?

Christopher Blizzard (:blizzard)

Reporter

Comment 116

•

13 years ago

Comment on attachment 564337 [details] [diff] [review] patch v1.2 Approved for Aurora (Firefox 8) and Aurora (Firefox 9). Please land as soon as possible.

Attachment #564337 - Flags: approval-mozilla-beta?

Attachment #564337 - Flags: approval-mozilla-beta+

Attachment #564337 - Flags: approval-mozilla-aurora?

Attachment #564337 - Flags: approval-mozilla-aurora+

Marco Bonardo [:mak]

Assignee

Comment 117

•

13 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/d2b41745ba9f https://hg.mozilla.org/releases/mozilla-beta/rev/22a3b2700d5d

status-firefox8: --- → fixed

status-firefox9: --- → fixed

:aceman

Updated

•

13 years ago

Blocks: 679800

u279076

Comment 118

•

13 years ago

Is there something specific QA can do to verify this fix, apart from general Places load testing?

Whiteboard: [qa?]

Marco Bonardo [:mak]

Assignee

Comment 119

•

13 years ago

Not this specific bug, I don't think you can test this specific mutex lockup. What you may check is how Firefox behaves after a clear history followed by a Sync import or setting up a new profile using Sync (or importing large number of bookmarks).

Marco Bonardo [:mak]

Assignee

Comment 120

•

13 years ago

PS: notice the above can't be tested in beta since hangs are not fixed in beta (due to bug 691509 missing approval to land there)

mlissner@michaeljaylissner.com

Comment 121

•

13 years ago

Affects Fennec? I have terrible performance there, and I also used sync on Fennec to bring in my places DB.

Marco Bonardo [:mak]

Assignee

Comment 122

•

13 years ago

it may affect fennec, but mostly what affects fennec is an exageration in the number of entries, that has been addressed some time ago by making expiration more clever, but it may take time to expire all redundant entries, I'll file a bug to adapt expiration to be more aggressive when the number of pages goes dangerously over the expected limit. I think jlebar filed a bug some days ago about this.

u279076

Comment 125

•

13 years ago

Based on above comments, I don't think QA will be able to directly verify this fix.

Whiteboard: [qa?] → [qa-]

stack traces while the browser is pegged 13 years ago Christopher Blizzard (:blizzard) (deleted), text/plain		Details
Another stacktrace - on a different system - while the browser hangs 13 years ago Arjen (deleted), text/plain		Details
patch v1.0 13 years ago Marco Bonardo [:mak] (deleted), patch		Details \| Diff \| Splinter Review
patch v1.1 13 years ago Marco Bonardo [:mak] (deleted), patch	asuth : review+ sdwilsh : review+	Details \| Diff \| Splinter Review
Places stats 13 years ago Justin Lebar (not reading bugmail) (deleted), text/plain		Details
Snippet from SQLite log during hang 13 years ago Arjen (deleted), text/plain		Details
Contents of sqlite_stat1 before and after Analyze 13 years ago Arjen (deleted), text/plain		Details
patch v1.2 13 years ago Marco Bonardo [:mak] (deleted), patch	blizzard : approval-mozilla-aurora+ blizzard : approval-mozilla-beta+	Details \| Diff \| Splinter Review