Closed Bug 917764 Opened 11 years ago Closed 11 years ago

[Async shutdown] If clients misbehave too much, we should crash

Categories

(Toolkit :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: Yoric, Assigned: Yoric)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file, 7 obsolete files)

At some point, we will end up with async shutdown clients that misbehave, i.e. never terminate. For the moment, we display a warning. When clients misbehave for too long, we need to crash, or at least exit(0) uncleanly, rather than draining their battery with a process that just won't quit.
Attached patch Crash upon longer timeout (WIP) (obsolete) (deleted) — Splinter Review
Here is a first draft, generally untested. The idea is the following: - before starting the event loop, we start a "loose timer"; - the loose timer counts beats at the rhythm of one per second, until it has received 60 beats – the idea is that beats won't be received if the computer is sleeping; - we cancel the timer once all conditions are satisfied; - after 60 beats, if the timer hasn't been canceled, display the list of conditions that haven't been satisfied yet, we crash and label the crash with that list of conditions.
Assignee: nobody → dteller
Attachment #806632 - Flags: feedback?(nfroyd)
Attachment #806632 - Flags: feedback?(irving)
Attachment #806632 - Flags: feedback?(benjamin)
Summary: [Async shutdown] Stricter timeouts → [Async shutdown] If clients misbehave too much, we should crash
Comment on attachment 806632 [details] [diff] [review] Crash upon longer timeout (WIP) Review of attachment 806632 [details] [diff] [review]: ----------------------------------------------------------------- The overall approach looks reasonable to me. I'd like to see telemetry on the number of ticks we count before completing, so we can get at least a little bit of feedback on how close to the limit some of our users are getting. This will only work for spins that happen before the last telemetry ping gets saved, so it's only going to help a little bit...
Attachment #806632 - Flags: feedback?(irving) → feedback+
Side-note: to be more robust, we should probably launch the watchdog in a thread. This will require some native code (to export |AnnotateCrashReport| to C), some js-ctypes code and this won't work after webWorkersShutdown. Alternatively, we can wait until we (possibly) port the core of this module to C++.
I don't think we should bother with another thread any time soon. The hangs we're trying to protect against are promises that fail to resolve and spin the event loop forever, which a timer will catch just fine. Code which spins in an infinite loop is more easily debuggable and much less interesting in general.
Attached patch Crash upon longer timeout (WIP), v2 (obsolete) (deleted) — Splinter Review
Fixed to work even if we do not have a crash reporter at hand.
Attachment #806632 - Attachment is obsolete: true
Attachment #806632 - Flags: feedback?(nfroyd)
Attachment #806632 - Flags: feedback?(benjamin)
Attachment #807179 - Flags: feedback?(nfroyd)
Attachment #807179 - Flags: feedback?(benjamin)
Comment on attachment 807179 [details] [diff] [review] Crash upon longer timeout (WIP), v2 Please make the entire annotation machine-readable JSON, i.e.: gCrashReporter.annotateCrashReport("AsyncShutdownTimeout": JSON.stringify({phase: topic, conditions: frozen});
Attachment #807179 - Flags: feedback?(benjamin) → feedback+
Attachment #807179 - Flags: feedback?(nfroyd) → feedback+
Attached patch Crash upon longer timeout, v3 (obsolete) (deleted) — Splinter Review
Applied Benjamin's suggestion. I took the opportunity to rename the file holding the tests from "test_phase.js" to "test_AsyncShutdown.js", which kind of makes more sense.
Attachment #807179 - Attachment is obsolete: true
Attachment #809780 - Flags: review?(nfroyd)
Comment on attachment 809780 [details] [diff] [review] Crash upon longer timeout, v3 >+ gCrashReporter.annotateCrashReport("AsyncShutdown Timeout", >+ JSON.stringify(data)); I think you should not have spaces in annotation keys, from what I see, we do not have that anywhere else.
Comment on attachment 809780 [details] [diff] [review] Crash upon longer timeout, v3 Review of attachment 809780 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/modules/AsyncShutdown.jsm @@ +59,5 @@ > + delete this.gCrashReporter; > + try { > + let reporter = Cc["@mozilla.org/xre/app-info;1"]. > + getService(Ci.nsICrashReporter); > + return this.gCrashReporter = null; Um, surely you mean |this.gCrashReporter = reporter|? @@ +89,5 @@ > dump("WARNING: " + error.stack + "\n"); > } > } > } > +function err(msg, error = null) { Since |err| and |warn| are so close, maybe you just want to refactor this?
Attachment #809780 - Flags: review?(nfroyd) → review+
Attached patch async.shut.timeout (obsolete) (deleted) — Splinter Review
Applied feedback (including a change to the crash reporter key).
Attachment #809780 - Attachment is obsolete: true
Attachment #810163 - Flags: review+
Attached patch Crash upon longer timeout, v4 (obsolete) (deleted) — Splinter Review
The duration before we crash can now be customized with a preference.
Attachment #810163 - Attachment is obsolete: true
Attachment #810554 - Flags: review?(nfroyd)
Attached patch Crash tests (obsolete) (deleted) — Splinter Review
Adding a crash test.
Attachment #810557 - Flags: review?(ted)
Attached patch Crash tests, v2 (obsolete) (deleted) — Splinter Review
Missing qref.
Attachment #810557 - Attachment is obsolete: true
Attachment #810557 - Flags: review?(ted)
Attachment #810559 - Flags: review?(ted)
Was thinking about this issue last night, related to Mobile behaviour - we want to save state when the application goes into the background, in case the OS kills us in our sleep (see for example https://hg.mozilla.org/mozilla-central/file/e85b0372cece/toolkit/components/telemetry/TelemetryPing.js#l963). To support this, I think we want event-specific control of how long the timeout is, and whether we crash on failure. I suppose we could use a parallel implementation for the go-to-sleep use case rather than making this one more complicated.
Comment on attachment 810559 [details] [diff] [review] Crash tests, v2 Review of attachment 810559 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/libpref/src/init/all.js @@ +470,5 @@ > pref("toolkit.identity.enabled", false); > pref("toolkit.identity.debug", false); > > +// AsyncShutdown delay before crashing in case of shutdown freeze > +pref("toolkit.asyncshutdown.timeout.crash", 60000); Does this want to be in this patch, or in one of the previous patches? ::: toolkit/crashreporter/test/unit/test_crash_AsyncShutdown.js @@ +6,5 @@ > + Components.utils.import("resource://gre/modules/Services.jsm", this); > + Components.utils.import("resource://gre/modules/Promise.jsm", this); > + > + Services.prefs.setBoolPref("toolkit.asyncshutdown.testing", true); > + Services.prefs.setIntPref("toolkit.asyncshutdown.timeout.crash", 10); Is this seconds? If so, can we make this smaller? If not, then it's fine.
Attachment #810559 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > Comment on attachment 810559 [details] [diff] [review] > Crash tests, v2 > > Review of attachment 810559 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/libpref/src/init/all.js > @@ +470,5 @@ > > pref("toolkit.identity.enabled", false); > > pref("toolkit.identity.debug", false); > > > > +// AsyncShutdown delay before crashing in case of shutdown freeze > > +pref("toolkit.asyncshutdown.timeout.crash", 60000); > > Does this want to be in this patch, or in one of the previous patches? Ah, my bad, this should have gone in the other patch. > ::: toolkit/crashreporter/test/unit/test_crash_AsyncShutdown.js > @@ +6,5 @@ > > + Components.utils.import("resource://gre/modules/Services.jsm", this); > > + Components.utils.import("resource://gre/modules/Promise.jsm", this); > > + > > + Services.prefs.setBoolPref("toolkit.asyncshutdown.testing", true); > > + Services.prefs.setIntPref("toolkit.asyncshutdown.timeout.crash", 10); > > Is this seconds? If so, can we make this smaller? If not, then it's fine. That's in milliseconds.
(In reply to :Irving Reid from comment #14) > Was thinking about this issue last night, related to Mobile behaviour - we > want to save state when the application goes into the background, in case > the OS kills us in our sleep (see for example > https://hg.mozilla.org/mozilla-central/file/e85b0372cece/toolkit/components/ > telemetry/TelemetryPing.js#l963). > > To support this, I think we want event-specific control of how long the > timeout is, and whether we crash on failure. I suppose we could use a > parallel implementation for the go-to-sleep use case rather than making this > one more complicated. Filed as bug 921426. Can you elaborate in that bug? I'm not sure I understand the scenario.
Comment on attachment 810554 [details] [diff] [review] Crash upon longer timeout, v4 Review of attachment 810554 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/modules/AsyncShutdown.jsm @@ +72,5 @@ > > + > +// Crash the process if shutdown is really too long > +// (allowing for sleep). > +const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.timeout.crash"; Nit: I like crash_timeout better than timeout.crash. Is this only for testing, or do you foresee this being useful in the field for users? Probably want to use something like "toolkit.asyncshutdown.testing.crash_timeout" if the former. @@ +101,5 @@ > } > } > } > +function warn(msg, error = null) { > + return log(msg, "WARNING: ", error); \o/
Attachment #810554 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd (:froydnj) from comment #18) > Comment on attachment 810554 [details] [diff] [review] > Crash upon longer timeout, v4 > > Review of attachment 810554 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: toolkit/modules/AsyncShutdown.jsm > @@ +72,5 @@ > > > > + > > +// Crash the process if shutdown is really too long > > +// (allowing for sleep). > > +const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.timeout.crash"; > > Nit: I like crash_timeout better than timeout.crash. > > Is this only for testing, or do you foresee this being useful in the field > for users? Probably want to use something like > "toolkit.asyncshutdown.testing.crash_timeout" if the former. I assume that some users will find something interesting to do with it, so I'm going to leave it without ".testing". However, let's replace "timeout.crash" with "crash_timeout". > > @@ +101,5 @@ > > } > > } > > } > > +function warn(msg, error = null) { > > + return log(msg, "WARNING: ", error); > > \o/ /me bows
Attached patch Crash upon longer timeout, v5 (deleted) — Splinter Review
Merged, took the opportunity to fix a comment and bug 921460. https://tbpl.mozilla.org/?tree=Try&rev=72bae1c146bb
Attachment #810554 - Attachment is obsolete: true
Attachment #810559 - Attachment is obsolete: true
Attachment #811100 - Flags: review+
Flags: in-testsuite+
Keywords: checkin-needed
Whiteboard: [fixed-in-fx-team]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → mozilla27
I'm probably missing the bigger picture, but what's the user impact of this change? The initial comment indicates we currently show a warning. Did we replace that (possibly) user-actionable warning with a crash? If so, how will a user react? If I were a user and my Firefox started crashing for "no reason" on shutdown (silent updates, remember), I'd think Firefox was seriously broken and I may consider switching browsers. Do we still have a mechanism to alert users what's going on? I don't think the existing crash reporter counts, as it doesn't really display any actionable message. If not, that seems... bad from a user experience perspective.
Let me explain the behavior and strategy. 1. If a shutdown phase takes less than 10 seconds, nothing special happens. 2. If a shutdown phase takes 10 seconds, we believe that there is something weird going on, but we do not want to alarm the user. We print a stderr warning – we could try to display something GUI to the user, but I believe it's not even possible at this stage. Bug 915861 will add Telemetry on this. 3. If a shutdown phase takes more than 60 ticks (where 1 tick = 1 second, corrected to take into account high CPU loads/slow computers/sleep), we consider that Firefox has currently frozen: Firefox that doesn't work, cannot be restarted, will not be killed by the system (because the event loop is still running) and cannot be stopped by the average user. In addition, this Firefox continues using memory, CPU, battery and possibly disk. For the moment, we believe that our best recourse is to kill automatically this Firefox, e.g. crash it. Additionally, the crash information provides us with information on which phase has failed to complete. By the way, you mention silent updates – recall that if Firefox cannot shutdown, the update cannot be completed at all. Now, we can investigate alternative strategies: - It might be possible to display some GUI to let the user decide whether to crash Firefox. Is that better than crashing? - It might be possible to display some GUI to let the user decide whether to proceed with shutdown unsafely. Is that better than crashing? - It might be possible to mark some blockers as "non-critical", i.e. we can proceed without crashing if they take too long (bug 921426). gps, if you have a clear idea of a good strategy, could you please file a bug?
I think if we had a crash log with annotated reasons why a crash occurred (bug 875562) and had some "doctor" service in the browser that could use this to present useful tips, that would help. This stuff is all planned around FHR. Just need action. I'm just concerned around UX regressions of this particular patch. Your explanation does makes me feel better about things. Thank you.
(In reply to Gregory Szorc [:gps] from comment #26) > I think if we had a crash log with annotated reasons why a crash occurred > (bug 875562) and had some "doctor" service in the browser that could use > this to present useful tips, that would help. This stuff is all planned > around FHR. Just need action. Once that bug has landed, I'll be glad to add support for providing any information that can be deemed useful. > I'm just concerned around UX regressions of this particular patch. Your > explanation does makes me feel better about things. Thank you. My pleasure :)
Depends on: 1018895
Comment on attachment 811100 [details] [diff] [review] Crash upon longer timeout, v5 >+// AsyncShutdown delay before crashing in case of shutdown freeze >+pref("toolkit.asyncshutdown.timeout.crash", 60000); ... >+// Crash the process if shutdown is really too long >+// (allowing for sleep). >+const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.crash_timeout"; Oops?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: