Closed Bug 917764 Opened 11 years ago Closed 11 years ago

[Async shutdown] If clients misbehave too much, we should crash

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla27

People

(Reporter: Yoric, Assigned: Yoric)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file, 7 obsolete files)

Crash upon longer timeout (WIP) 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	Irving : feedback+	Details \| Diff \| Splinter Review
Crash upon longer timeout (WIP), v2 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	froydnj : feedback+ benjamin : feedback+	Details \| Diff \| Splinter Review
Crash upon longer timeout, v3 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	froydnj : review+	Details \| Diff \| Splinter Review
async.shut.timeout 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	Yoric : review+	Details \| Diff \| Splinter Review
Crash upon longer timeout, v4 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	froydnj : review+	Details \| Diff \| Splinter Review
Crash tests 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch		Details \| Diff \| Splinter Review
Crash tests, v2 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	ted : review+	Details \| Diff \| Splinter Review
Crash upon longer timeout, v5 11 years ago David Teller [:Yoric] - still alive but not very active (deleted), patch	Yoric : review+	Details \| Diff \| Splinter Review

David Teller [:Yoric] - still alive but not very active

Assignee

Description

•

11 years ago

At some point, we will end up with async shutdown clients that misbehave, i.e. never terminate. For the moment, we display a warning. When clients misbehave for too long, we need to crash, or at least exit(0) uncleanly, rather than draining their battery with a process that just won't quit.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 1

•

11 years ago

Attached patch Crash upon longer timeout (WIP) (obsolete) (deleted) — Details — Splinter Review

Here is a first draft, generally untested. The idea is the following: - before starting the event loop, we start a "loose timer"; - the loose timer counts beats at the rhythm of one per second, until it has received 60 beats – the idea is that beats won't be received if the computer is sleeping; - we cancel the timer once all conditions are satisfied; - after 60 beats, if the timer hasn't been canceled, display the list of conditions that haven't been satisfied yet, we crash and label the crash with that list of conditions.

Assignee: nobody → dteller

Attachment #806632 - Flags: feedback?(nfroyd)

Attachment #806632 - Flags: feedback?(irving)

Attachment #806632 - Flags: feedback?(benjamin)

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Summary: [Async shutdown] Stricter timeouts → [Async shutdown] If clients misbehave too much, we should crash

:Irving Reid (No longer working on Firefox)

Comment 2

•

11 years ago

Comment on attachment 806632 [details] [diff] [review] Crash upon longer timeout (WIP) Review of attachment 806632 [details] [diff] [review]: ----------------------------------------------------------------- The overall approach looks reasonable to me. I'd like to see telemetry on the number of ticks we count before completing, so we can get at least a little bit of feedback on how close to the limit some of our users are getting. This will only work for spins that happen before the last telemetry ping gets saved, so it's only going to help a little bit...

Attachment #806632 - Flags: feedback?(irving) → feedback+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 3

•

11 years ago

Side-note: to be more robust, we should probably launch the watchdog in a thread. This will require some native code (to export |AnnotateCrashReport| to C), some js-ctypes code and this won't work after webWorkersShutdown. Alternatively, we can wait until we (possibly) port the core of this module to C++.

Benjamin Smedberg

Comment 4

•

11 years ago

I don't think we should bother with another thread any time soon. The hangs we're trying to protect against are promises that fail to resolve and spin the event loop forever, which a timer will catch just fine. Code which spins in an infinite loop is more easily debuggable and much less interesting in general.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 5

•

11 years ago

Attached patch Crash upon longer timeout (WIP), v2 (obsolete) (deleted) — Details — Splinter Review

Fixed to work even if we do not have a crash reporter at hand.

Attachment #806632 - Attachment is obsolete: true

Attachment #806632 - Flags: feedback?(nfroyd)

Attachment #806632 - Flags: feedback?(benjamin)

Attachment #807179 - Flags: feedback?(nfroyd)

Attachment #807179 - Flags: feedback?(benjamin)

Benjamin Smedberg

Comment 6

•

11 years ago

Comment on attachment 807179 [details] [diff] [review] Crash upon longer timeout (WIP), v2 Please make the entire annotation machine-readable JSON, i.e.: gCrashReporter.annotateCrashReport("AsyncShutdownTimeout": JSON.stringify({phase: topic, conditions: frozen});

Attachment #807179 - Flags: feedback?(benjamin) → feedback+

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Blocks: 919535

Nathan Froyd [:froydnj]

Updated

•

11 years ago

Attachment #807179 - Flags: feedback?(nfroyd) → feedback+

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Depends on: 920439

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 7

•

11 years ago

Attached patch Crash upon longer timeout, v3 (obsolete) (deleted) — Details — Splinter Review

Applied Benjamin's suggestion. I took the opportunity to rename the file holding the tests from "test_phase.js" to "test_AsyncShutdown.js", which kind of makes more sense.

Attachment #807179 - Attachment is obsolete: true

Attachment #809780 - Flags: review?(nfroyd)

Robert Kaiser

Comment 8

•

11 years ago

Comment on attachment 809780 [details] [diff] [review] Crash upon longer timeout, v3 >+ gCrashReporter.annotateCrashReport("AsyncShutdown Timeout", >+ JSON.stringify(data)); I think you should not have spaces in annotation keys, from what I see, we do not have that anywhere else.

Nathan Froyd [:froydnj]

Comment 9

•

11 years ago

Comment on attachment 809780 [details] [diff] [review] Crash upon longer timeout, v3 Review of attachment 809780 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/modules/AsyncShutdown.jsm @@ +59,5 @@ > + delete this.gCrashReporter; > + try { > + let reporter = Cc["@mozilla.org/xre/app-info;1"]. > + getService(Ci.nsICrashReporter); > + return this.gCrashReporter = null; Um, surely you mean |this.gCrashReporter = reporter|? @@ +89,5 @@ > dump("WARNING: " + error.stack + "\n"); > } > } > } > +function err(msg, error = null) { Since |err| and |warn| are so close, maybe you just want to refactor this?

Attachment #809780 - Flags: review?(nfroyd) → review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 10

•

11 years ago

Attached patch async.shut.timeout (obsolete) (deleted) — Details — Splinter Review

Applied feedback (including a change to the crash reporter key).

Attachment #809780 - Attachment is obsolete: true

Attachment #810163 - Flags: review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 11

•

11 years ago

Attached patch Crash upon longer timeout, v4 (obsolete) (deleted) — Details — Splinter Review

The duration before we crash can now be customized with a preference.

Attachment #810163 - Attachment is obsolete: true

Attachment #810554 - Flags: review?(nfroyd)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 12

•

11 years ago

Attached patch Crash tests (obsolete) (deleted) — Details — Splinter Review

Adding a crash test.

Attachment #810557 - Flags: review?(ted)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 13

•

11 years ago

Attached patch Crash tests, v2 (obsolete) (deleted) — Details — Splinter Review

Missing qref.

Attachment #810557 - Attachment is obsolete: true

Attachment #810557 - Flags: review?(ted)

Attachment #810559 - Flags: review?(ted)

:Irving Reid (No longer working on Firefox)

Comment 14

•

11 years ago

Was thinking about this issue last night, related to Mobile behaviour - we want to save state when the application goes into the background, in case the OS kills us in our sleep (see for example https://hg.mozilla.org/mozilla-central/file/e85b0372cece/toolkit/components/telemetry/TelemetryPing.js#l963). To support this, I think we want event-specific control of how long the timeout is, and whether we crash on failure. I suppose we could use a parallel implementation for the go-to-sleep use case rather than making this one more complicated.

(not currently active) Ted Mielczarek

Comment 15

•

11 years ago

Comment on attachment 810559 [details] [diff] [review] Crash tests, v2 Review of attachment 810559 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/libpref/src/init/all.js @@ +470,5 @@ > pref("toolkit.identity.enabled", false); > pref("toolkit.identity.debug", false); > > +// AsyncShutdown delay before crashing in case of shutdown freeze > +pref("toolkit.asyncshutdown.timeout.crash", 60000); Does this want to be in this patch, or in one of the previous patches? ::: toolkit/crashreporter/test/unit/test_crash_AsyncShutdown.js @@ +6,5 @@ > + Components.utils.import("resource://gre/modules/Services.jsm", this); > + Components.utils.import("resource://gre/modules/Promise.jsm", this); > + > + Services.prefs.setBoolPref("toolkit.asyncshutdown.testing", true); > + Services.prefs.setIntPref("toolkit.asyncshutdown.timeout.crash", 10); Is this seconds? If so, can we make this smaller? If not, then it's fine.

Attachment #810559 - Flags: review?(ted) → review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 16

•

11 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > Comment on attachment 810559 [details] [diff] [review] > Crash tests, v2 > > Review of attachment 810559 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/libpref/src/init/all.js > @@ +470,5 @@ > > pref("toolkit.identity.enabled", false); > > pref("toolkit.identity.debug", false); > > > > +// AsyncShutdown delay before crashing in case of shutdown freeze > > +pref("toolkit.asyncshutdown.timeout.crash", 60000); > > Does this want to be in this patch, or in one of the previous patches? Ah, my bad, this should have gone in the other patch. > ::: toolkit/crashreporter/test/unit/test_crash_AsyncShutdown.js > @@ +6,5 @@ > > + Components.utils.import("resource://gre/modules/Services.jsm", this); > > + Components.utils.import("resource://gre/modules/Promise.jsm", this); > > + > > + Services.prefs.setBoolPref("toolkit.asyncshutdown.testing", true); > > + Services.prefs.setIntPref("toolkit.asyncshutdown.timeout.crash", 10); > > Is this seconds? If so, can we make this smaller? If not, then it's fine. That's in milliseconds.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 17

•

11 years ago

(In reply to :Irving Reid from comment #14) > Was thinking about this issue last night, related to Mobile behaviour - we > want to save state when the application goes into the background, in case > the OS kills us in our sleep (see for example > https://hg.mozilla.org/mozilla-central/file/e85b0372cece/toolkit/components/ > telemetry/TelemetryPing.js#l963). > > To support this, I think we want event-specific control of how long the > timeout is, and whether we crash on failure. I suppose we could use a > parallel implementation for the go-to-sleep use case rather than making this > one more complicated. Filed as bug 921426. Can you elaborate in that bug? I'm not sure I understand the scenario.

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Blocks: 921426

Nathan Froyd [:froydnj]

Comment 18

•

11 years ago

Comment on attachment 810554 [details] [diff] [review] Crash upon longer timeout, v4 Review of attachment 810554 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/modules/AsyncShutdown.jsm @@ +72,5 @@ > > + > +// Crash the process if shutdown is really too long > +// (allowing for sleep). > +const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.timeout.crash"; Nit: I like crash_timeout better than timeout.crash. Is this only for testing, or do you foresee this being useful in the field for users? Probably want to use something like "toolkit.asyncshutdown.testing.crash_timeout" if the former. @@ +101,5 @@ > } > } > } > +function warn(msg, error = null) { > + return log(msg, "WARNING: ", error); \o/

Attachment #810554 - Flags: review?(nfroyd) → review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 19

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #18) > Comment on attachment 810554 [details] [diff] [review] > Crash upon longer timeout, v4 > > Review of attachment 810554 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: toolkit/modules/AsyncShutdown.jsm > @@ +72,5 @@ > > > > + > > +// Crash the process if shutdown is really too long > > +// (allowing for sleep). > > +const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.timeout.crash"; > > Nit: I like crash_timeout better than timeout.crash. > > Is this only for testing, or do you foresee this being useful in the field > for users? Probably want to use something like > "toolkit.asyncshutdown.testing.crash_timeout" if the former. I assume that some users will find something interesting to do with it, so I'm going to leave it without ".testing". However, let's replace "timeout.crash" with "crash_timeout". > > @@ +101,5 @@ > > } > > } > > } > > +function warn(msg, error = null) { > > + return log(msg, "WARNING: ", error); > > \o/ /me bows

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 20

•

11 years ago

Attached patch Crash upon longer timeout, v5 (deleted) — Details — Splinter Review

Merged, took the opportunity to fix a comment and bug 921460. https://tbpl.mozilla.org/?tree=Try&rev=72bae1c146bb

Attachment #810554 - Attachment is obsolete: true

Attachment #810559 - Attachment is obsolete: true

Attachment #811100 - Flags: review+

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 22

•

11 years ago

https://hg.mozilla.org/integration/fx-team/rev/de4c92324898

Flags: in-testsuite+

Keywords: checkin-needed

Whiteboard: [fixed-in-fx-team]

Ryan VanderMeulen [:RyanVM]

Comment 23

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/de4c92324898

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Whiteboard: [fixed-in-fx-team]

Target Milestone: --- → mozilla27

Gregory Szorc [:gps]

Comment 24

•

11 years ago

I'm probably missing the bigger picture, but what's the user impact of this change? The initial comment indicates we currently show a warning. Did we replace that (possibly) user-actionable warning with a crash? If so, how will a user react? If I were a user and my Firefox started crashing for "no reason" on shutdown (silent updates, remember), I'd think Firefox was seriously broken and I may consider switching browsers. Do we still have a mechanism to alert users what's going on? I don't think the existing crash reporter counts, as it doesn't really display any actionable message. If not, that seems... bad from a user experience perspective.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 25

•

11 years ago

Let me explain the behavior and strategy. 1. If a shutdown phase takes less than 10 seconds, nothing special happens. 2. If a shutdown phase takes 10 seconds, we believe that there is something weird going on, but we do not want to alarm the user. We print a stderr warning – we could try to display something GUI to the user, but I believe it's not even possible at this stage. Bug 915861 will add Telemetry on this. 3. If a shutdown phase takes more than 60 ticks (where 1 tick = 1 second, corrected to take into account high CPU loads/slow computers/sleep), we consider that Firefox has currently frozen: Firefox that doesn't work, cannot be restarted, will not be killed by the system (because the event loop is still running) and cannot be stopped by the average user. In addition, this Firefox continues using memory, CPU, battery and possibly disk. For the moment, we believe that our best recourse is to kill automatically this Firefox, e.g. crash it. Additionally, the crash information provides us with information on which phase has failed to complete. By the way, you mention silent updates – recall that if Firefox cannot shutdown, the update cannot be completed at all. Now, we can investigate alternative strategies: - It might be possible to display some GUI to let the user decide whether to crash Firefox. Is that better than crashing? - It might be possible to display some GUI to let the user decide whether to proceed with shutdown unsafely. Is that better than crashing? - It might be possible to mark some blockers as "non-critical", i.e. we can proceed without crashing if they take too long (bug 921426). gps, if you have a clear idea of a good strategy, could you please file a bug?

Gregory Szorc [:gps]

Comment 26

•

11 years ago

I think if we had a crash log with annotated reasons why a crash occurred (bug 875562) and had some "doctor" service in the browser that could use this to present useful tips, that would help. This stuff is all planned around FHR. Just need action. I'm just concerned around UX regressions of this particular patch. Your explanation does makes me feel better about things. Thank you.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 27

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #26) > I think if we had a crash log with annotated reasons why a crash occurred > (bug 875562) and had some "doctor" service in the browser that could use > this to present useful tips, that would help. This stuff is all planned > around FHR. Just need action. Once that bug has landed, I'll be glad to add support for providing any information that can be deemed useful. > I'm just concerned around UX regressions of this particular patch. Your > explanation does makes me feel better about things. Thank you. My pleasure :)

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 1018895

neil@parkwaycc.co.uk

Comment 28

•

10 years ago

Comment on attachment 811100 [details] [diff] [review] Crash upon longer timeout, v5 >+// AsyncShutdown delay before crashing in case of shutdown freeze >+pref("toolkit.asyncshutdown.timeout.crash", 60000); ... >+// Crash the process if shutdown is really too long >+// (allowing for sleep). >+const PREF_DELAY_CRASH_MS = "toolkit.asyncshutdown.crash_timeout"; Oops?

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 29

•

10 years ago

Ah, right.

You need to log in before you can comment on or make changes to this bug.