Closed
Bug 828149
Opened 12 years ago
Closed 12 years ago
Firefox Health Report causes crash during application shutdown [on Talos Linux64 opt]
Categories
(Firefox Health Report Graveyard :: Client: Desktop, defect)
Firefox Health Report Graveyard
Client: Desktop
Tracking
(firefox19 unaffected, firefox20+ fixed, firefox21+ fixed)
RESOLVED
FIXED
Firefox 21
Tracking | Status | |
---|---|---|
firefox19 | --- | unaffected |
firefox20 | + | fixed |
firefox21 | + | fixed |
People
(Reporter: gps, Assigned: gps)
References
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
rnewman
:
review+
lsblakk
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
We're seeing weird hangs and crashes on Talos with Firefox Health Report. We appear to encounter this 100% of the time with Linux64 opt builds. Other builders seem to encounter Talos red more frequently with FHR enabled.
From https://tbpl.mozilla.org/?tree=Try&rev=9f41fae3cc07:
NOISE: 1357666153772 Services.HealthReport.HealthReporter INFO Initializing health reporter instance against healthreport.
NOISE: 1357666153864 Sqlite.ConnectionOpener INFO Opening database: /tmp/tmpy-6Gx6/profile/healthreport.sqlite (healthreport.sqlite#1)
NOISE: 1357666153866 Sqlite.Connection.healthreport.sqlite INFO Conn #1: Opened
NOISE: 1357666153866 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Beginning transaction
NOISE: 1357666153866 Services.Metrics.MetricsStorage DEBUG Database schema up to date.
NOISE: 1357666153866 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Transaction committed.
NOISE: 1357666153867 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #0 SELECT * FROM types
NOISE: 1357666153871 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Stmt #0 finished
NOISE: 1357666153872 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #1 SELECT * FROM v_measurements
NOISE: 1357666153874 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Stmt #1 finished
NOISE: 1357666153874 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #2 SELECT * FROM v_fields
NOISE: 1357666153877 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Stmt #2 finished
NOISE: 1357666153877 Services.HealthReport.HealthReporter INFO Storage initialized.
NOISE: 1357666153877 Services.HealthReport.HealthReporter INFO Initializing collector.
NOISE: 1357666153877 Services.HealthReport.HealthReporter INFO Registering providers from category: healthreport-js-provider
NOISE: 1357666153877 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: AddonsProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153879 Services.Metrics.Collector INFO Initializing provider with storage: org.mozilla.addons
NOISE: 1357666153880 Services.Metrics.MetricsStorage TRACE Enqueueing operation.
NOISE: 1357666153880 Services.Metrics.MetricsStorage TRACE Performing queued operation.
NOISE: 1357666153880 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #3 INSERT INTO providers (name) VALUES (:provider) - {"provider":"org.mozilla.addons"}
NOISE: 1357666153880 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: AppInfoProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153881 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: CrashesProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153881 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: ProfileMetadataProvider from resource://gre/modules/services/healthreport/profile.jsm
NOISE: 1357666153884 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: SessionsProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153885 Services.HealthReport.HealthReporter INFO Attempting to load provider from category manager: SysInfoProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153997 Sqlite.Connection.healthreport.sqlite DEBUG Conn #1: Stmt #3 finished
NOISE: 1357666153997 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #4 SELECT id FROM providers WHERE name = :provider - {"provider":"org.mozilla.addons"}
NOISE: 1357666153998 Services.Metrics.MetricsStorage WARN Failure when performing queued operation: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999 Services.Metrics.Collector WARN Provider initialization failed: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999 Services.HealthReport.HealthReporter ERROR Error during initialization: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999 Services.HealthReport.HealthReporter INFO Request to shut down.
NOISE: 1357666153999 Services.HealthReport.HealthReporter WARN Collector is in progress of initializing. Waiting to finish.
NOISE: 1357666153999 Services.Metrics.Collector INFO Initializing provider with storage: org.mozilla.appInfo
NOISE: 1357666154000 Services.Metrics.MetricsStorage TRACE Enqueueing operation.
NOISE: 1357666154000 Services.Metrics.MetricsStorage TRACE Performing queued operation.
NOISE: 1357666154000 Sqlite.Connection.healthreport.sqlite TRACE Conn #1: Stmt #5 INSERT INTO providers (name) VALUES (:provider) - {"provider":"org.mozilla.appInfo"}
NOISE: 1357666154002 Services.Metrics.MetricsStorage WARN Failure when performing queued operation: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:989 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < createMeasurementOperation()@resource://gre/modules/services/metrics/storage.jsm:985 < resource://gre/modules/services/metrics/storage.jsm:1277 < resource://gre/modules/services/metrics/storage.jsm:1260 < resource://gre/modules/services/metrics/storage.jsm:984 < init()@resource://gre/modules/services/metrics/dataprovider.jsm:429 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < resource://gre/modules/services/metrics/dataprovider.jsm:421 < resource://gre/modules/services/metrics/collector.jsm:92 < onError()@resource://gre/modules/services/metrics/collector.jsm:122 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < onError()@resource://gre/modules/services/metrics/storage.jsm:1308 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666154003 Services.Metrics.Collector WARN Provider initialization failed: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:989 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < createMeasurementOperation()@resource://gre/modules/services/metrics/storage.jsm:985 < resource://gre/modules/services/metrics/storage.jsm:1277 < resource://gre/modules/services/metrics/storage.jsm:1260 < resource://gre/modules/services/metrics/storage.jsm:984 < init()@resource://gre/modules/services/metrics/dataprovider.jsm:429 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < resource://gre/modules/services/metrics/dataprovider.jsm:421 < resource://gre/modules/services/metrics/collector.jsm:92 < onError()@resource://gre/modules/services/metrics/collector.jsm:122 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < onError()@resource://gre/modules/services/metrics/storage.jsm:1308 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
The order of operations is most weird. We're seeing an INSERT performed. Immediately afterward, a SELECT fails with NS_ERROR_UNEXPECTED. Immediately thereafter, we see a log entry for shutdown initialization.
It certainly appears that shutdown is causing some kind of assertion. However, we've taken steps to ensure that the FHR shutdown and database termination logic is sound. We have unit test coverage injecting shutdown requests at various phases of the FHR service lifecycle and everything appears to work. Furthermore, logs from other Talos runners reveal that shutdown is handled properly on other builders.
I've talked with Marco for the Storage perspective and Jeff Hammel for the Talos perspective and nothing obvious sticks out so far. Talos seems to initiate shutdown the same way as Firefox and it isn't employing any aggressive process killing that would cause database corruption. The current theories are:
1) The database is getting corrupted somehow. Possible check: push a try build with "PRAGMA integrity_check" on connection open and see if corruption is detected.
2) We're running into a Storage bug (try to repro with a debug build and get storage logs)
I'll be actively chasing down these and other minor theories.
There is a potential this will introduce a general crasher if landed in its current state. I've talked to Alex Keybl about this and he has blessed the idea of landing FHR on m-c enabled except on Linux64. If we see crashes on other platforms in the wild, we'll assess the severity and back out if needed.
Comment 1•12 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #0)
> The order of operations is most weird. We're seeing an INSERT performed.
> Immediately afterward, a SELECT fails with NS_ERROR_UNEXPECTED. Immediately
> thereafter, we see a log entry for shutdown initialization.
It sounds like your read might be failing with SQLITE_BUSY (which gets converted to NS_ERROR_UNEXPECTED). With the default rollback journal mode, your reads will fail if a write operation is occurring simultaneously
http://www.sqlite.org/draft/lockingv3.html
Assignee | ||
Comment 2•12 years ago
|
||
Interesting. We should only ever have 1 process consuming the db file and only 1 connection per process to the database. So, there should be no contention on the database. My reading of that document is that if this holds, we should have no trouble with SQLITE_BUSY unless there is a hot journal.
Marco: I see from section 4.1 of the linked document that you are supposed to take special actions when dealing with hot journals. Is this something Storage does for us or must all SQLite consumers implement this functionality?
Also, I'm wondering if there are some PRAGMA commands we should issue upon opening the database. Currently we use none (unless Storage is issuing some automatically).
Flags: needinfo?(mak77)
Comment 3•12 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #2)
> Interesting. We should only ever have 1 process consuming the db file and
> only 1 connection per process to the database. So, there should be no
> contention on the database. My reading of that document is that if this
> holds, we should have no trouble with SQLITE_BUSY unless there is a hot
> journal.
Didn't we discuss this afternoon what might happen if the DB is replaying the journal as you open it?
Comment 4•12 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #2)
> Interesting. We should only ever have 1 process consuming the db file and
> only 1 connection per process to the database. So, there should be no
> contention on the database. My reading of that document is that if this
> holds, we should have no trouble with SQLITE_BUSY unless there is a hot
> journal.
That's right, if you only have 1 thread in 1 process accessing the database there should be no problem with busy since everything is just serialized.
> Marco: I see from section 4.1 of the linked document that you are supposed
> to take special actions when dealing with hot journals. Is this something
> Storage does for us or must all SQLite consumers implement this
> functionality?
AFAIK the only thing Storage does is using a busy callback on the async thread to wait for mainthread. No special threatment for hot journals is provided, though, as the documentation states, hot journals are a special case and I don't think we are hitting that (and even if we would, I'd not expect issues with a single connection).
> Also, I'm wondering if there are some PRAGMA commands we should issue upon
> opening the database. Currently we use none (unless Storage is issuing some
> automatically).
You may do some experiments, and maybe these could give us an hint.
1. change journal mode: PRAGMA journal_mode = truncate
2. if you use temp tables: PRAGMA temp_store = MEMORY
3. change synchronous: PRAGMA synchronous = FULL
did you try to handle that problem with cancelling all statements on shutdown? I wonder if there's a relation with what you see.
Flags: needinfo?(mak77)
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to Marco Bonardo [:mak] from comment #4)
> did you try to handle that problem with cancelling all statements on
> shutdown? I wonder if there's a relation with what you see.
This cannot be the issue because we only cancel statements in the Sqlite.jsm .close() routine and we aren't seeing the log message indicating that method is anywhere close to being called.
Updated•12 years ago
|
Assignee: nobody → gps
status-firefox19:
--- → unaffected
tracking-firefox20:
--- → +
tracking-firefox21:
--- → +
Assignee | ||
Comment 6•12 years ago
|
||
Ehsan provided me a log that clearly indicated FHR being initialized *after* quit-application was notified. He also confirmed that nsITimer instances aren't cancelled as part of shutdown. I'm reasonably confident we found the source of our shutdown crashes.
Attachment #702078 -
Flags: review?(rnewman)
Attachment #702078 -
Flags: feedback?(ehsan)
Comment 7•12 years ago
|
||
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1
rs=me. I did test a build with this patch and I didn't observe the crash, and I promised gps to keep an eye on this once the patch lands if I see the aborts again.
Attachment #702078 -
Flags: feedback?(ehsan) → feedback+
Comment 8•12 years ago
|
||
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1
>+ // Shutdown doesn't clear pending timers. So, we need to explicitly
>+ // cancel our health reporter initialization timer or else it will
>+ // attempt initialization after shutdown has commenced. This would
>+ // likely lead to stalls or crashes.
>+ if (this.timer) {
>+ this.timer.cancel();
>+ }
>+
Is this racey? i.e. could the timer fire and en-queue the callback (initialization function) before we call timer.cancel()? If so, then we could still sometimes get these crashes
Comment 9•12 years ago
|
||
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1
Review of attachment 702078 [details] [diff] [review]:
-----------------------------------------------------------------
You should make a note somewhere that we're quitting, and check for it in `get healthReporter()`. If the timer has fired but the contents haven't run, we'll still die. (This is a narrower race.)
I would also suggest that HealthReporter install a quit listener in its constructor, and do a similar thing in init. Its initialization is asynchronous, so we can hit the point of fetching an instance on timer (constructor finished), but have the quit occur prior to execution of all the delayed init work.
I'll accept a proof that this is impossible instead.
Attachment #702078 -
Flags: review?(rnewman) → feedback+
Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #8)
> Is this racey? i.e. could the timer fire and en-queue the callback
> (initialization function) before we call timer.cancel()? If so, then we
> could still sometimes get these crashes
I'm not sure! While I might be able to test it, I don't want to risk it. I'll revise the patch to record that "quit-application" has fired and check for it in the timer callback.
Regarding rnewman's point about HealthReporter installing a quit listener: it does! However, it's only installed just before storage is initialized, not in the ctor (https://hg.mozilla.org/services/services-central/file/6d3cb2d6873e/services/healthreport/healthreporter.jsm#l251). Same difference as far as I'm concerned.
Assignee | ||
Comment 11•12 years ago
|
||
Attachment #702078 -
Attachment is obsolete: true
Attachment #702102 -
Flags: review?(rnewman)
Comment 12•12 years ago
|
||
Comment on attachment 702102 [details] [diff] [review]
Cancel initialization timer on shutdown, v2
Review of attachment 702102 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM.
Attachment #702102 -
Flags: review?(rnewman) → review+
Comment 13•12 years ago
|
||
Landed on Greg's behalf.
https://hg.mozilla.org/services/services-central/rev/ad87f1e26c5a
Ehsan, please verify after merge when you're happy that this is working! Marking you with NEEDINFO accordingly.
Thanks!
Status: NEW → ASSIGNED
Flags: needinfo?(ehsan)
Whiteboard: [fixed in services]
Assignee | ||
Comment 14•12 years ago
|
||
Enabled FHR on Linux:
https://hg.mozilla.org/services/services-central/rev/450f0995c9ad
If it doesn't fix the Talos issue, we'll back out before merging.
Comment 15•12 years ago
|
||
No need for the needinfo flag. I'll complain if I see bug 830145 or bug 830448 again when this gets merged.
(And note that I don't actually know if this patch fixes this bug. I think Gregory posted the patch here because he strongly believes this is a dupe of bug 830145, but I'd have expected to see this patch there...)
Flags: needinfo?(ehsan)
Assignee | ||
Comment 16•12 years ago
|
||
I posted the patch here because this is the oldest bug, was called out in my dev-planning post (thus has a number of interested followers), and I strongly feel your 2 shutdown-related bugs are dupes.
Assignee | ||
Comment 17•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
status-firefox20:
--- → affected
status-firefox21:
--- → fixed
Resolution: --- → FIXED
Whiteboard: [fixed in services]
Target Milestone: --- → mozilla21
Assignee | ||
Comment 18•12 years ago
|
||
Comment on attachment 702102 [details] [diff] [review]
Cancel initialization timer on shutdown, v2
[Approval Request Comment]
Blanket FHR approval request. This patch seems to fix our Linux crasher!
Attachment #702102 -
Flags: approval-mozilla-aurora?
Updated•12 years ago
|
Attachment #702102 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 21•12 years ago
|
||
Pushed to Aurora:
http://hg.mozilla.org/releases/mozilla-aurora/rev/71009680316b
Updated•12 years ago
|
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla21 → Firefox 21
Updated•6 years ago
|
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•