Closed Bug 1309696 Opened 8 years ago Closed 8 years ago

Crash on Mac OS X [@ je_malloc | libsystem_malloc.dylib@0x...] from infinite recursion of sqlite3SelectDup

Categories

(Toolkit :: Storage, defect)

52 Branch
Unspecified
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: esthermonchari, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-10a90335-a476-4fb3-a15d-597942160929.
=============================================================

Crash involves multiple users,Started with build id 20160929030426 and is still continuing.

adding ni on:Carsten,could this be related with Bug 1011839 ?

adding ni on:Stuart, any ideas on what could be causing the crash?
Flags: needinfo?(pavlov)
Flags: needinfo?(cbook)
Hi Esther, not sure if this is the same bug. Some Steps to reproduce like urls etc would useful i guess
Flags: needinfo?(cbook)
Flags: needinfo?(pavlov)
Crashes seem to be on 10.12 only: http://bit.ly/2e16ZYt. After reviewing the data, this seems to the same user crashing repeatedly.
 (In reply to Esther Monchari from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-10a90335-a476-4fb3-a15d-597942160929.
> =============================================================
> 
> Crash involves multiple users,Started with build id 20160929030426 and is
> still continuing.

Looking at your crash history, I see this dates to well prior to the above build.
See bp-dbfe9b53-72ff-4db7-bcdc-b54312160815 which is Build ID 20160815030201

At least 1/3 in first 20 seconds, and another ~1/4 in the first 120 seconds
And in the past few days you crash on average several times per day [1]. Ditto for early for early November. 
But in the period between 2016-11-11 2016-12-1 you only have 2 crashes.

So what happened in mid to late Noevember that might have led to so few crashes?
Can you provide steps?
Does starting in safe mode tend to reduce the startup crashes?

[1] 
bp-2109115f-3a6e-44a6-ba68-aed902161202	2016-12-02 17:36:52 	arena_avail_tree_remove | arena_run_split | arena_run_alloc | arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9   
bp-4804a2ba-407d-4687-a836-cbfc82161202	2016-12-02 17:26:31 	arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9  
bp-c362b50a-ed06-4ed3-bf4f-26b892161202	2016-12-02 01:48:12 	arena_avail_tree_remove | arena_run_split | arena_run_alloc | arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9   
bp-0aa097d7-d3bf-49d1-a3e9-27c1f2161202	2016-12-02 01:44:06 	arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9  
bp-88c0da7e-831d-445f-a535-b98ed2161202	2016-12-02 01:32:22 	arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9   
bp-a0fe4285-88f4-4c40-b047-e47322161202	2016-12-02 01:32:01 	arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9   
bp-d057cd11-5e66-4dcc-a4e4-f48b72161202	2016-12-02 01:31:38 	EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST
Severity: normal → critical
Flags: needinfo?(esthermonchari)
The crash stack in bp-10a90335-a476-4fb3-a15d-597942160929 is interesting:
Frame	Module	Signature	Source
0 	libmozglue.dylib	je_malloc	memory/mozjemalloc/jemalloc.c:4323
Ø 1 	libsystem_malloc.dylib	libsystem_malloc.dylib@0x2331	
Ø 2 	libsystem_malloc.dylib	libsystem_malloc.dylib@0x12af	
3 	libnss3.dylib	sqlite3Malloc	db/sqlite3/src/sqlite3.c:24067
4 	libnss3.dylib	dbMallocRawFinish	db/sqlite3/src/sqlite3.c:24424
5 	libnss3.dylib	exprDup	db/sqlite3/src/sqlite3.c:24492
6 	libnss3.dylib	sqlite3ExprListDup	db/sqlite3/src/sqlite3.c:90468
7 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90578
8 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90585
9 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90585 
...
1022 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90585
1023 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90585
1024 	libnss3.dylib	sqlite3SelectDup	db/sqlite3/src/sqlite3.c:90585 

Given that sqlite3SelectDup actually is recursive, it seems like this might be
a bug in sqlite3:

(Note that this file is huge and takes a while to load)
https://hg.mozilla.org/mozilla-central/annotate/f7d5008ee2ab/db/sqlite3/src/sqlite3.c#l90585
Crash Signature: [@ je_malloc | libsystem_malloc.dylib@0x2331 ] → [@ je_malloc | libsystem_malloc.dylib@0x2331 ] [@ arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9] [@ je_malloc | libsystem_malloc.dylib@0x22d9 ] [@ arena_avail_tree_remove | arena_run_split | arena_run_alloc | arena_malloc | je_malloc | libsy…
Summary: Crash on Mac OS X [@ je_malloc | libsystem_malloc.dylib@0x2331] → Crash on Mac OS X [@ je_malloc | libsystem_malloc.dylib@0x...] from infinite recursion of sqlite3SelectDup
I withdraw my questions of comment 1 - Esther obviously isn't the person crashing. Plus, Stephen's crashes haven't stopped.
Flags: needinfo?(esthermonchari)
I see a similar issue at another address. For instance:
  bp-3e7044bc-b1dd-4357-8619-a0ddb2170201
Crash Signature: libsystem_malloc.dylib@0x22d9 ] [@ arena_run_alloc | arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9 ] → libsystem_malloc.dylib@0x22d9 ] [@ arena_run_alloc | arena_malloc | je_malloc | libsystem_malloc.dylib@0x22d9 ] [@ arena_malloc | je_malloc | libsystem_malloc.dylib@0x22c1 ]
I guess I'll move it over to Toolkit::Storage for now.
Component: General → Storage
Product: Core → Toolkit
(In reply to Mats Palmgren (:mats) from comment #4)
> Given that sqlite3SelectDup actually is recursive, it seems like this might
> be a bug in sqlite3:

I wonder if Richard can gather anything useful out of these stacks (comment 4 and comment 7).
Flags: needinfo?(drh)
Dan was able to reproduce the stack trace by running an SQL statement like this:

CREATE VIEW v1 AS VALUES(0,0,0),(0,0,0),<Repeat 130,000 times>,(0,0,0);

It appears to be overflowing the stack.  Smaller VALUES clauses (less than about 131,000 terms) work.  We will be working on a fix.

Does Firefox ever generate a VIEW like that?

Another construct that invokes sqlite3SelectDup() on a massive VALUES clause would be an INSERT statement inside of a TRIGGER.  Does Firefox ever generate the SQL for triggers at run-time?

One work-around to this problem is to rebuild with -DSQLITE_MAX_SQL_LENGTH=1000000 or some other reasonable limit on the length of an SQL statement. (The default limit is 1GB.) The CREATE VIEW above that crashes is about 5MB in size, so setting an SQL size limit of 1MB seems reasonable.
A patch has been checked into the SQLite trunk that avoids the deep recursion when processing a VALUES clause with hundreds of thousands of rows.  We can backport this patch to version 3.14.1, or any other version you need, if desired.  Otherwise, the patch will appear in the 3.17.0 release, probably sometime in February.
(In reply to D. Richard Hipp from comment #10)
> Does Firefox ever generate a VIEW like that?

We don't seem to use VIEWs, I just did a plain code search for "CREATE.*VIEW".
Though, I can't exclude add-ons.

> Another construct that invokes sqlite3SelectDup() on a massive VALUES clause
> would be an INSERT statement inside of a TRIGGER.  Does Firefox ever
> generate the SQL for triggers at run-time?

I couldn't find any in our codebase. All the triggers that do an insert are just inserting a single entry.
Again, I can't exclude add-ons.

The strange thing is that this is Mac 10.12.1 only, if this would be something in the queries I'd expect it to hit on all the platforms. Off-hand I can't think of Mac-only code that uses Sqlite, I'll keep digging.

> One work-around to this problem is to rebuild with
> -DSQLITE_MAX_SQL_LENGTH=1000000 or some other reasonable limit on the length
> of an SQL statement. (The default limit is 1GB.) The CREATE VIEW above that
> crashes is about 5MB in size, so setting an SQL size limit of 1MB seems
> reasonable.

We could evaluate that as it seems a good idea in general, but we clearly couldn't apply that to the system sqlite on Linux. Also, unfortunately we still have some queries that can generate large IN clauses, we should first check we're not hitting a limit there, and that's also a perf problem (we should use TEMP tables instead).
(In reply to D. Richard Hipp from comment #11)
> We can backport this patch to version 3.14.1, or any other version
> you need, if desired.  Otherwise, the patch will appear in the 3.17.0
> release, probably sometime in February.

The crash rate is quite low, I think it will be fine to wait for the next release, I just wanted to be sure you were informed about this stack.
Add-on correlations in Nightly are:
BrowserStack
Desktop messenger for WhatsAppβ„’
FoxClocks
Tamper Data
FoxyProxy Standard
@activity-streams
LastPass Password Manager
wayback_machine@mozilla.org

I'm sure some of these use Sqlite. Just in case I'm ni Nan to ensure Activity Stream doesn't build large VALUES() clauses in triggers/views, since we can check that easily.
Flags: needinfo?(najiang)
(In reply to Marco Bonardo [::mak] from comment #12)
> > One work-around to this problem is to rebuild with
> > -DSQLITE_MAX_SQL_LENGTH=1000000 or some other reasonable limit on the length
> > of an SQL statement. (The default limit is 1GB.) The CREATE VIEW above that
> > crashes is about 5MB in size, so setting an SQL size limit of 1MB seems
> > reasonable.
> 
> We could evaluate that as it seems a good idea in general, but we clearly
> couldn't apply that to the system sqlite on Linux. Also, unfortunately we
> still have some queries that can generate large IN clauses, we should first
> check we're not hitting a limit there, and that's also a perf problem (we
> should use TEMP tables instead).

If you want, the SQL length limit can also be set at run-time for each database connection using sqlite3_limit(db, SQLITE_LIMIT_SQL_LENGTH, 250000).

Rather than generating larger IN clauses, could you use the "carray" extension: https://www.sqlite.org/carray.html
Hi Marco,

So Activity Stream doesn't use triggers/views at all. Also, it uses SQLITE_MAX_VARIABLE_NUMBER (default as 999) to cap the # of host parameters. However, it doesn't check the length of SQL queries.

Hope this helps.
Flags: needinfo?(najiang)
(In reply to D. Richard Hipp from comment #15)
> If you want, the SQL length limit can also be set at run-time for each
> database connection using sqlite3_limit(db, SQLITE_LIMIT_SQL_LENGTH, 250000).

we could do that, carefully. Will file a bug.

> Rather than generating larger IN clauses, could you use the "carray"
> extension: https://www.sqlite.org/carray.html

Yes, it's something we want to implement, but we don't have resources to do that (bug 483318). One of the problems we have to solve is that our Sqlite.jsm API currently tries to bind arrays as blobs. We may have to change blobs to use a typedArray to distinguish the 2 cases.
Will file a bug.
Given that 52 is our next ESR, I think it would be good to get some sort of in-product fix that we can uplift. I'm not sure if that means we take a targeted SQLite fix or an in-tree fix, though?
Flags: needinfo?(mak77)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)
> Given that 52 is our next ESR, I think it would be good to get some sort of
> in-product fix that we can uplift. I'm not sure if that means we take a
> targeted SQLite fix or an in-tree fix, though?

Not worth the complexity based on the number of reports, we'll take a fix when it's ready, unless the crash ratio increases to meaningful numbers.
Flags: needinfo?(mak77)
I get this reproducibly, a couple of minutes after start up, even without add-ons.
 Just turn on  sync. Started on FF51. I get the same on 52 and 54. Several thousand bookmarks.

Going back to 49 things are stable again

eg for 54
e2791c60-c8e6-46bf-b4e7-276cd2170209
Glad you said that @AM because I tried to not sync bookmarks and it seems to be stable.

Indeed, I've got lots of bookmarks too in "Unsorted bookmarks".
Thank you, that's useful information, at least we have a possible path to investigate. Sync recently changed some code and queries, I wonder if due to those changes we may end up building a very large query.
Kit, we should check all the Sync changes that went into 51 and try to identify the query causing this with many thousands bookmarks in a single folder. It's likely once we take Sqlite 3.17.0 the query will fail, and the sync will be interrupted.
At this point possible culprits could be bug 1274108, bug 1293365

(In reply to AM from comment #21)
>  Just turn on  sync. Started on FF51. I get the same on 52 and 54. Several
> thousand bookmarks.
> 
> Going back to 49 things are stable again

Why not 50? Does it crash or not?
Can you tell us how many thousands bookmarks you have in the Unsorted folder?

Richard, can this also happen for WITH queries? Something like this:
WITH sorting(g, p) AS (
 VALUES ${valuesTable}
)
UPDATE moz_bookmarks SET position = (
 SELECT CASE count(*) WHEN 0 THEN -position
                             ELSE count(*) - 1
         END
  FROM sorting a
  JOIN sorting b ON b.p <= a.p
  WHERE a.g = guid
 )
 WHERE parent = :parentId

where ${valuesTable} is a very large list.
Flags: needinfo?(kit)
Flags: needinfo?(augustmiles)
Depends on: SQLite3.17.0
Depends on: 1339390
It's very likely this affects a small number of users that have a lot of bookmarks into unsorted, that would explain why we see repeated reports from the same users, and on a single platform.

A possible workaround to upgrade to a newer version may be to move some of those bookmarks into subfolders.
I've got 7000 bookmarks in "other bookmarks". :) And to be fair, "other bookmarks" is for me my "read later" folder.
First I thought it was because of lack of RAM on my macbookair (4gb) when it didn't crash on my windows computer (6gb) but now i've got a 16gb macbookpro and it still crashes. So I think it's a macOS only bug to be fair.
Sync switched to using `PlacesUtils.bookmarks.reorder` in bug 1274108. Assuming the CTE in `reorder` is the problem, that lines up with comment 23.
(In reply to Marco Bonardo [::mak] from comment #23)
> Richard, can this also happen for WITH queries? Something like this:
> WITH sorting(g, p) AS (
>  VALUES ${valuesTable}
> )
> UPDATE ...
> where ${valuesTable} is a very large list.

Yes.  Any VALUES clause with a large number of rows (130,000 or more) would use excessive stack space in 3.16.2 and earlier.  In 3.17.0, it should work and get the correct answer without deep recursion.
(In reply to Erwann Mest from comment #25)
> I've got 7000 bookmarks in "other bookmarks". :) And to be fair, "other
> bookmarks" is for me my "read later" folder.

I have ~6900 (so, very close to the same number, obviously), and have been crashing (though it's been stable lately).
I have 12,000 bookmarks. 4/day for ten years. I tend to annotate in detail using tags -- and work by keywords.  Almost all are professional links. It seems a reasonable use of a tagging system. I work on a big campus, so I tend to have a fair number of machines (5 or 6) that synchronize depending on where I am working.

I went back to 49 because 50 was already using lots of cpu and eating battery on my macbook.
"The strange thing is that this is Mac 10.12.1 only", no, it also happens on the previous macOS.
Flags: needinfo?(augustmiles)
All development versions have been updated to Sqlite 3.17.0.
Thus the crash should not happen anymore on Nightly, and in the next DevEdition and Beta releases (not sure when they will be released exactly, likely in a week).
If you could reproduce easily, please try Nightly, or wait for next dev/beta releases, and let us know.

We are still investigating ways to reduce the number of times we need to reorder large list of bookmarks during a Sync in bug 1339390, but there isn't much more we can do here.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(kit)
Flags: needinfo?(drh)
Resolution: --- → FIXED
Firefox 52 beta 7, released today, should contain the fix for this. Or a recently Nightly/DevEdition build will have it. Would be great if anybody in this bug who could reproduce the problem would be able to verify that things work now :)
It seems to be fixed. I don't have any crash for some days. :) Thank you all.
You need to log in before you can comment on or make changes to this bug.