Closed Bug 741836 Opened 13 years ago Closed 13 years ago

Native crash without backtrace in SQLiteBridge

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
critical

Tracking

(blocking-fennec1.0 -)

RESOLVED FIXED
Firefox 14
Tracking Status
blocking-fennec1.0 --- -

People

(Reporter: gcp, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [native-crash])

Attachments

(1 file)

Bug 736237 makes Profile Migration resumable. When testing this, I ran into an issue where the app spontaneously exits without a Java or native crash trace. Attaching jdb changes nothing. Attaching gdb makes the application work correctly. Using checkJNI makes the application work correctly. Tracing with printf debugging shows that the app exits at this statement:

https://mxr.mozilla.org/mozilla-central/source/mozglue/android/SQLiteBridge.cpp#342

This reeks of memory corruption. It would be cool if someone could valgrind this scenario and see if we have bugs.

STR: 
1) Apply the patches from bug 736237.
2) Apply the attached patch.
3) Copy a big places.sqlite into the Fennec profile. (eg. from desktop machine)
4) Start Fennec, let migration finish.
5) Quit Fennec
6) Start Fennec. Fennec will crash halfway.
blocking-fennec1.0: --- → ?
Blocks: 736237
We didn't reproduce the issue under valgrind yet, but there is something interesting in the trace:

I//data/local/start_valgrind_fennec( 1608): ==1609== Thread 11:
I//data/local/start_valgrind_fennec( 1608): ==1609== Invalid read of size 1
I//data/local/start_valgrind_fennec( 1608): ==1609==    at 0x4806620: strlen (mc_replace_strmem.c:391)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x4831AAD: strndup (strndup.c:33)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CB0B: report_mapping (APKOpen.cpp:681)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x22251103: CustomElf::Load(Mappable*, char const*, int) (CustomElf.cpp:211)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224FDC3: ElfLoader::Load(char const*, int, LibHandle*) (ElfLoader.cpp:251)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224FE8B: __wrap_dlopen (ElfLoader.cpp:46)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CDF3: loadSQLiteLibs(char const*) (APKOpen.cpp:810)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CEBD: Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x540CBF3: dvmPlatformInvoke (CallEABI.S:258)
I//data/local/start_valgrind_fennec( 1608): ==1609==  Address 0x4cb7082 is 0 bytes after a block of size 658 alloc'd
I//data/local/start_valgrind_fennec( 1608): ==1609==    at 0x4805440: malloc (vg_replace_malloc.c:263)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224C46D: extractBuf(char const*, Zip*) (APKOpen.cpp:648)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CDC9: loadSQLiteLibs(char const*) (APKOpen.cpp:804)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CEBD: Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945)
I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x540CBF3: dvmPlatformInvoke (CallEABI.S:258)
(In reply to Gian-Carlo Pascutto (:gcp) from comment #1)
> We didn't reproduce the issue under valgrind yet, but there is something
> interesting in the trace:
> 
> I//data/local/start_valgrind_fennec( 1608): ==1609== Thread 11:
> I//data/local/start_valgrind_fennec( 1608): ==1609== Invalid read of size 1
> I//data/local/start_valgrind_fennec( 1608): ==1609==    at 0x4806620: strlen
> (mc_replace_strmem.c:391)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x4831AAD:
> strndup (strndup.c:33)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CB0B:
> report_mapping (APKOpen.cpp:681)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x22251103:
> CustomElf::Load(Mappable*, char const*, int) (CustomElf.cpp:211)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224FDC3:
> ElfLoader::Load(char const*, int, LibHandle*) (ElfLoader.cpp:251)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224FE8B:
> __wrap_dlopen (ElfLoader.cpp:46)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CDF3:
> loadSQLiteLibs(char const*) (APKOpen.cpp:810)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CEBD:
> Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x540CBF3:
> dvmPlatformInvoke (CallEABI.S:258)
> I//data/local/start_valgrind_fennec( 1608): ==1609==  Address 0x4cb7082 is 0
> bytes after a block of size 658 alloc'd
> I//data/local/start_valgrind_fennec( 1608): ==1609==    at 0x4805440: malloc
> (vg_replace_malloc.c:263)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224C46D:
> extractBuf(char const*, Zip*) (APKOpen.cpp:648)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CDC9:
> loadSQLiteLibs(char const*) (APKOpen.cpp:804)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x2224CEBD:
> Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945)
> I//data/local/start_valgrind_fennec( 1608): ==1609==    by 0x540CBF3:
> dvmPlatformInvoke (CallEABI.S:258)

That's bug 740318.
Julian valgrinded the exact case that's failing for me but nothing showed up.
GCP, I thought this was blocking bug 736237 from landing in which case it would be a blocker. Bug, bug 736237 landed. Can you explain?
We didn't find anything even after extensive valgrinding of the exact scenario that crashes for me (which didn't fail for sewardj, anyway), and nobody had any further idea how to progress. 

I figured that landing this would either get us some more reports and help to identify a pattern, and in the alternate case, show that this crash almost never happens.

Put differently, the alternative to landing was not doing anything.


I got an idea in the meantime to let this run and don't pass the data we get off SQLite into Android, to see if it's not some Android lib going bonkers.
I managed to crash it while stepping in the debugger:

full log: http://pastebin.mozilla.org/1560695

last part:

sqlite3VdbeExec (p=0x5c5579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:65422
65422	  if( p->rc==SQLITE_NOMEM ){
(gdb) 
65427	  assert( p->rc==SQLITE_OK || p->rc==SQLITE_BUSY );
(gdb) 
65428	  p->rc = SQLITE_OK;
(gdb) 
65429	  assert( p->explain==0 );
(gdb) 
65430	  p->pResultSet = 0;
(gdb) 
65431	  db->busyHandler.nBusy = 0;
(gdb) 
65432	  CHECK_FOR_INTERRUPT;
(gdb) 
65435	  checkProgress = db->xProgress!=0;
(gdb) 

65438	  sqlite3BeginBenignMalloc();
(gdb) 

Program received signal SIGSEGV, Segmentation fault.

0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so
And another one:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 15437]
0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so
(gdb) bt
#0  0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so
#1  0xaff1cd9c in __assert2 () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so
#2  0x5dabc4ee in sqlite3MutexAlloc (id=2) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:17352
#3  0x5db46990 in enterMutex () at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:115604
#4  0x5db46b94 in sqlite3ConnectionUnlocked (db=0x5c422408) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:115725
#5  0x5daf031e in sqlite3VdbeHalt (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:61797
#6  0x5daf5afe in sqlite3VdbeExec (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:65712
#7  0x5daf2704 in sqlite3Step (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:63161
#8  0x5daf2946 in sqlite3_step (pStmt=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:63234
#9  0x80b2a1ca in Java_org_mozilla_gecko_sqlite_SQLiteBridge_sqliteCall (jenv=0x117c68, jDb=0x408bb1f8, jQuery=0x407b0690, 
    jParams=0x40939318, jQueryRes=0x409393e8) at /home/morbo/hg/mozilla-central/mozglue/android/SQLiteBridge.cpp:362
#10 0xaca11d38 in ?? ()
#11 0xaca11d38 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

This actually makes a little sense; the crash always happens at the last "step" inside the query, so this would seem consistent with the crash originating from sqlite3VdbeHalt.
Is this bug related to bug 743715
I needed bug 743715 to provide further evidence that https://bugzilla.mozilla.org/show_bug.cgi?id=730495#c2 is correct.
Disabling jemalloc makes the crasher go away (this isn't 100% proof that jemalloc is the cause, tho).
That would seem to suggest something is allocating memory with the system allocator, and sqlite is giving that to some jemalloc function.
Mike, one suspect I am looking at is:
http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageService.cpp#574

Which would trigger because the SQLiteBridge is called before Gecko is loaded, so it will be calling into sqlite3 without these wrappers, providing incorrect alignment.

But I'd expect that to trigger some assert in sqlite3 itself, at least that's what the bugs that caused this wrapper to be added suggest. So I'm not 100% sure.
Bug 722884 has some similarities to this one. Unsure if related.
blocking-fennec1.0: ? → -
Severity: normal → critical
Keywords: crash
Whiteboard: [native-crash]
(In reply to Gian-Carlo Pascutto (:gcp) from comment #10)
> Disabling jemalloc makes the crasher go away

That might explain why the various valgrind runs found nothing -- I was
using a non-jemalloc build.
I put some logging into SQLite itself:

http://pastebin.mozilla.org/1565953

The sqlite3MutexEnd call is probably what kills us. This is called from sqlite3_shutdown. Richard informed me yesterday that our calling of this function (which was in his opinion unnecessary) could be a cause. Looks like he's right!

Theory: the failed sqlite3_config call causes the storage service to shut down and sqlite3_shutdown to be called.
This is probably fixed by bug 721784.
Status: NEW → RESOLVED
Closed: 13 years ago
Depends on: 721784
Resolution: --- → FIXED
Target Milestone: --- → Firefox 14
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: