Closed Bug 1462046 Opened 7 years ago Closed 6 years ago

mozilla::storage::AsyncExecuteStatements::Run spins the cpu at 100% for longer than I'm willing to wait

Categories

(Toolkit :: Places, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox60 --- unaffected
firefox61 --- unaffected
firefox62 + fixed

People

(Reporter: jrmuizel, Assigned: mak)

References

Details

(Keywords: regression, Whiteboard: [fxsearch])

I've run into this twice in the last two days. Both times ended up with Firefox locking up completely. It has been suggested that this might be caused by bug 1239708.
we have some heavy schema migration task to do, ideally for a few minutes. Not sure if it's related to that bug, it's possible, but in such a case it would be a once-only thing.
This happened to me again. Is there something I can do find out what's causing it?
Flags: needinfo?(mak77)
Yes, it was very likely due to the new autofill, we had a few queries taking too long time on the helper thread, and the main thread was locking waiting for that. Bug 1461753 should have solved this, but we're still monitoring the situation. Please let us know if things improve. Regarding how to find the cause, in the about:telemetry page there should be a slow SQL section that may point at multi-second queries, that are more likely to be the causes.
Blocks: 1239708
Depends on: 1461753
Flags: needinfo?(mak77)
please let us know if this is still common/happening
Flags: needinfo?(jmuizelaar)
How are you monitoring? Is there any particular telemetry query/alert we could use?
Flags: needinfo?(mak77)
We have a couple telemetry probes that showed a spike related to this, PLACES_IDLE_FRECENCY_DECAY_TIME_MS and PLEASES_IDLE_MAINTENANCE_TIME_MS. We're also monitoring reported bugs and common support forums for newly reported hangs. Unfortunately Slow SQL telemetry has been removed from servers, I filed bug 1367023 to get it back, since it would be extremely useful, but there has been no action there. I was hoping to also use arewesmoothyet, but it seems to be a little bit behind in time (Showing April?). Mike, do we have any "real-time" (even a couple builds back) tracking of the number of hangs in Nightly? arewesmoothyet seems to be back by 15 days or more.
Flags: needinfo?(mak77) → needinfo?(mconley)
Priority: -- → P1
Whiteboard: [fxsearch]
The telemetry probes involving frecency (see comment 6) have gone back to usual values after the fix for Bug 1461753.
Redirecting to dthayer - I'm not sure when the jobs run.
QA Contact: gwimberly
Trying to kick the arewesmoothyet job manually now. It just has issues when run automatically, even though it should be doing the exact same thing. Tried to run it manually yesterday but I forgot to extend the lifetime of the cluster and it timed out. Anyway, just giving an update. Leaving the needinfo until it's done.
Hi Marco, Jeff, Any suggested STR here to be able to see if I can reproduce the hang reported here?
Flags: needinfo?(mak77)
Flags: qe-verify?
This is still happening for me. Makes Awesomebar completely unusable and as this usually happens after the computer was in hibernate and needs a restart for the daily Nightly update, I also noticed that the update fails the first time and I have to force it again after starting Nightly again. So I suspect Nightly can't actually restart but hangs in shutdown without a browser window showing anymore. I found a very slow SQL statement on helper threads from earlier today that I probably cancelled with a browser restart after 17875 milliseconds. The statement is: UPDATE moz_places SET frecency = CALCULATE_FRECENCY(id) WHERE frecency = -1 /* places.sqlite */ Here is a screenshot: https://abload.de/img/unbenanntxsuoo.png I would be willing to share my profile data if that helps in any way. I'm running a pretty powerful setup with an i7 6700k, so that is probably not the cause. I can't reproduce on my other machine, a much slower laptop.
Before https://bugzilla.mozilla.org/show_bug.cgi?id=1461753 landed I noticed these very slow statements: 6977 ms PRAGMA integrity_check /* places.sqlite */ 3535 ms DELETE FROM moz_updateoriginsinsert_temp /* places.sqlite */
(In reply to TMart from comment #14) > UPDATE moz_places SET frecency = CALCULATE_FRECENCY(id) WHERE frecency = -1 > /* places.sqlite */ Aha! That's almost certainly the mirror: https://searchfox.org/mozilla-central/rev/dc6d85680539cb7e5fe2843d38a30a0581bfefe1/toolkit/components/places/SyncedBookmarksMirror.jsm#4532-4535 The `BEGIN IMMEDIATE` is probably queuing up behind that. TMart, I'm guessing you have `services.sync.engine.bookmarks.buffer` enabled? Would you mind sharing how many bookmarks you have? Mak, is there a good way for us to chunk frecency recalculation for new URLs? Or, is there an approximate frecency we can use that's "good enough" for bookmarks? Bookmarks sync before history, so there's a very good chance we'll recalculate the frecency when we sync history, anyway.
Yes, I have the new Bookmark Sync enabled. Sorry, totally forgot about that. I have about 1500 bookmarks in total.
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #16) > Mak, is there a good way for us to chunk frecency recalculation for new > URLs? Or, is there an approximate frecency we can use that's "good enough" > for bookmarks? Bookmarks sync before history, so there's a very good chance > we'll recalculate the frecency when we sync history, anyway. In general, I think we should completely re-evaluate the way we calculate frecency, probably by having a single invalidate field we set and recalculating in chunks on idle.
(In reply to Grover Wimberly IV [:Grover-QA] from comment #12) > Any suggested STR here to be able to see if I can reproduce the hang > reported here? It's tricky, it requires a profile with a large places.sqlite (70MiB) and new Sync disabled (to avoid false positives). Just browsing and immediately opening a bookmarks view (menu/overflow of the toolbar) could be enough. But it may depend on timing, because some of the expensive operations happen on idle or even worse on idle-daily, that is hard to reproduce at will. Checking Slow SQL in about:telemetry for multi-second statements execution is also a good hint. I fear we'll have to rely on telemetry and arewesmoothyet (when it gets updated), plus we're trying to get back slow SQL telemetry reports.
Flags: needinfo?(mak77)
Flags: qe-verify? → qe-verify+
Actually, this should stay ? until we discuss more about this on Wednesday.
Flags: qe-verify+ → qe-verify?
This may be fixed now in bug 1461753. mak, can you verify that from telemetry results or do we need more time?
Flags: needinfo?(mak77)
The frecency telemetry I pointed at in comment 6 confirms that a computation regression disappeared (there's a clear spike). We didn't get new reports of this problem and last week I spent some time on arewesmoothyet to check if anything would jump to my eyes. I didn't notice any particular stacks that could be related to this. That said, verifying this problem is tricky, our current long term plan includes: 1. getting back Slow SQL telemetry (work is in progress) and monitor that 2. Evaluate a new approach to frecency recalculation (that kit filed as bug 1466267) I don't think we can do more here, but please if you still see this problem in Nightly, let us know ASAP.
Assignee: nobody → mak77
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mak77)
Flags: needinfo?(dothayer)
Resolution: --- → FIXED
Firefox Nightly on macOS has been frequently hanging for me since mid-May, particularly when my MacBook Pro wakes from hibernation. With today's Nightly build (2018-06-05), however, the hangs are much worse. STR seems to be to enable one or more extensions and then open Firefox's History menu. I tried bisecting the hangs with mozregression, which pointed to autofill bug 1239708 and led me to this hang bug. Marco, do my hangs sound like a different problem than this bug?
Flags: needinfo?(mak77)
When it happens, could you please check for Slow SQL telemetry in about:telemetry and report any queries? The hangs from hibernations is something I heard often in the past, so it may not be particularly new. Can you post some examples of WebExtensions causing this? Fwiw, I'm not sure bisecting here will work properly because bug 1239708 has effectively added an hang problem to the existing ones (currently resolved), that may make it easier to reproduce a hang just after bug bug 1239708 landed, but may still not be the same hang.
Flags: needinfo?(mak77)
Flags: needinfo?(cpeterson)
Here are the slow SQL queries logged after I reproduced the (~70 second) beach ball hang. Slow SQL Statements on Main Thread Hits Avg. Time (ms) Statement 1 75231 SELECT h.id, h.url, h.title AS page_title, h.rev_host, h.visit_count, h.last_visit_date, null, null, null, null, null, null AS tags , h.frecency, h.hidden, h.guid, null, null, null FROM moz_places h WHERE h.hidden = 0 AND EXISTS (SELECT id FROM moz_historyvisits WHERE place_id = h.id AND visit_type NOT IN (0,4,8) LIMIT 1) ORDER BY last_visit_date DESC LIMIT 15 /* places.sqlite */ Slow SQL Statements on Helper Threads Hits Avg. Time (ms) Statement 2 39383 DELETE FROM moz_updateoriginsinsert_temp /* places.sqlite */ I can usually reproduce the hang when enabling one or both of these simple extensions (and the opening the History menu): https://addons.mozilla.org/en-US/firefox/addon/always-right/ https://addons.mozilla.org/en-US/firefox/addon/in-page-pop-up-reporter/
Flags: needinfo?(cpeterson)
Thank you, basically the second statement took an unexpected long time to run (40s) and due to legacy architectural problem we didn't solve yet (UI queries are on the main-thread yet), the main-thread tries to run a query in the meanwhile, and locks on the connection mutex. It doesn't look like related to the autofill change though, and as such I think it should be filed as a separate new bug that we'll investigate apart. Would you mind filing such a bug? Honestly I don't think the add-ons have any relation with this, or at least I can't see one. Is the disk particularly busy and loud (if mechanical) before these lock-ups happen? Did you enable the new Sync mirroring?
Flags: needinfo?(cpeterson)
(In reply to Marco Bonardo [::mak] from comment #26) > It doesn't look like related to the autofill change though, and as such I > think it should be filed as a separate new bug that we'll investigate apart. > Would you mind filing such a bug? I filed bug 1467537 and answered your questions there.
Flags: needinfo?(cpeterson)
Per prior discussion on this bug on Vidyo with Marco/Drew, taking off the qe-verify+ tag. If the situation on this changes, please set the flag back and NeedInfo me.
Flags: qe-verify?
Flags: needinfo?(jmuizelaar)
You need to log in before you can comment on or make changes to this bug.