Closed Bug 1648142 Opened 4 years ago Closed 4 years ago

Ensure we flush cert_storage operations during fast shutdown

Categories

(Core :: Security: PSM, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: dthayer, Assigned: dthayer)

References

Details

(Whiteboard: [fxperf:p1])

Attachments

(1 file)

No description provided.

This just spins the event loop during fast shutdown until all queued
cert_storage tasks have completed. The patch achieves this by simply
adding a counter which will be incremented and decremented on the
main thread via tying into the tasks' new and done methods. A
slightly more performant solution would use a condvar and sleep the
main thread waiting on pending operations to complete, but given the
low frequency of these occuring during shutdown, such an approach
would be overkill.

Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/90f0cea80878 Block on cert storage ops prior to shutdown r=keeler
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79

Is there any reason we don't use AsyncShutdown for things like these?

Historically, spinning event loops during shutdown has been a major cause of Firefox shutdowns, crashes and I suspect data loss. AsyncShutdown was created specifically to get rid of these.

Flags: needinfo?(dothayer)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #4)

Is there any reason we don't use AsyncShutdown for things like these?

Historically, spinning event loops during shutdown has been a major cause of Firefox shutdowns, crashes and I suspect data loss. AsyncShutdown was created specifically to get rid of these.

Shutdown is littered with spinning event loops. Every time we shutdown a thread on the main thread during shutdown (which we do a lot if we're not in fast shutdown), we spin the main thread event loop waiting for it to let us know it's cleared out its event queue. That being said, are there more details you have regarding these problems? Bug numbers? I definitely don't want to fall into any pits we've fallen into before, and I hope I'm not missing something in my understanding here.

Also, I suppose I'm a little confused - can you explain to me how this is different than SpinEventLoopUntil?

Part of why I am explicitly calling this and spinning the event loop waiting for it (I would prefer to use a condvar and sleep the main thread waiting for it, but it happens infrequently enough that I think it's unlikely to be worth it), rather than using a friendlier API like AsyncShutdown, is that I don't want to be in a situation where adding shutdown blockers feels like something we can freely do. I want it to feel ugly and bad to make something block shutdown. Hopefully to help illustrate why, here is very incomplete sample of things which have been added as AsyncShutdown blockers seemingly without regard for shutdown performance:

A lot of the AsyncShutdown handlers also, I think just to be safe, clear out their internal dictionaries, remove event handlers, and things like that, thinking that these operations are just cheap. However, in aggregate, they aren't because it means we waste time 100% unnecessarily freeing memory. When we free memory, we have no guarantee that the memory is actually paged in, and we have to write to it, so that means we may do a read off disk just to throw it away.

A bit of evidence that code which looks cheap is not cheap: the latest fast shutdown advancement (skipping everything between xpcom-shutdown and the last cycle collection) seems to be showing around a 20% reduction in median shutdown times on Nightly. Based on telemetry, we aren't skipping any disk writes or anything - we're just skipping work which seems to be entirely unnecessary, and which I'm fairly confident looks cheap to engineers.

This all being said, please don't get me wrong - I do like the overall intent of the AsyncShutdown stuff! And if you disagree with me or think that there's a way we can salvage the benefits of the API while preserving a lockdown on shutdown performance creep, please do let me know.

Flags: needinfo?(dothayer)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #5)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #4)

Is there any reason we don't use AsyncShutdown for things like these?

Historically, spinning event loops during shutdown has been a major cause of Firefox shutdowns, crashes and I suspect data loss. AsyncShutdown was created specifically to get rid of these.

Shutdown is littered with spinning event loops. Every time we shutdown a thread on the main thread during shutdown (which we do a lot if we're not in fast shutdown), we spin the main thread event loop waiting for it to let us know it's cleared out its event queue. That being said, are there more details you have regarding these problems? Bug numbers? I definitely don't want to fall into any pits we've fallen into before, and I hope I'm not missing something in my understanding here.

Well, I'm getting slightly nervous :)

I don't remember bug numbers, but shutdown spinloops are something we attempted to clean up ~5 years ago at the time we got serious about actually shutting down Firefox. I don't know whether there are still people monitoring this (I used to be the one doing that), but failing to shutdown, once I added facilities to detect them, were #1 (and #2 and a few others) crashers for some time and most of these errors were related to event loop spinning.

Typically, by spinning the event loop, code loses the consistency guarantees that we, as developers, are used to. We expect that, whenever even A is sent before event B, if we start handling event A, we'll stop handling it before we handle event B. With event loop spinning, by definition, we lose this consistency. So we have code that looks like A;B;C but is actually executed as A1;B1;C;B2;A2. This messed up with resource cleanup, shutdown order, etc.

Also, I suppose I'm a little confused - can you explain to me how this is different than SpinEventLoopUntil?

Well, AsyncShutdown was designed to replace event loop spinning with a higher-level API.

More precisely, the main differences are:

  1. AsyncShutdown will detect errors and submit detailed crash reports;
  2. regardless of the implementation of AsyncShutdown, you're working with Promises, which means that any asynchronicity in your code is marked explicitly, rather than hidden behind a function call.

Part of why I am explicitly calling this and spinning the event loop waiting for it (I would prefer to use a condvar and sleep the main thread waiting for it, but it happens infrequently enough that I think it's unlikely to be worth it), rather than using a friendlier API like AsyncShutdown, is that I don't want to be in a situation where adding shutdown blockers feels like something we can freely do. I want it to feel ugly and bad to make something block shutdown. Hopefully to help illustrate why, here is very incomplete sample of things which have been added as AsyncShutdown blockers seemingly without regard for shutdown performance:

A lot of the AsyncShutdown handlers also, I think just to be safe, clear out their internal dictionaries, remove event handlers, and things like that, thinking that these operations are just cheap. However, in aggregate, they aren't because it means we waste time 100% unnecessarily freeing memory. When we free memory, we have no guarantee that the memory is actually paged in, and we have to write to it, so that means we may do a read off disk just to throw it away.

Good point: if there are blockers that are not useful, we definitely need to get rid of them.

Let's do a PSA on this!

But please, for things that do need to block shutdown, let's make sure that they use AsyncShutdown. If I have time, one of these days, I'll see if there's a way to poison event loop spinning during shutdown, to make sure that we don't make shutdown fragile again.

A bit of evidence that code which looks cheap is not cheap: the latest fast shutdown advancement (skipping everything between xpcom-shutdown and the last cycle collection) seems to be showing around a 20% reduction in median shutdown times on Nightly. Based on telemetry, we aren't skipping any disk writes or anything - we're just skipping work which seems to be entirely unnecessary, and which I'm fairly confident looks cheap to engineers.

This all being said, please don't get me wrong - I do like the overall intent of the AsyncShutdown stuff! And if you disagree with me or think that there's a way we can salvage the benefits of the API while preserving a lockdown on shutdown performance creep, please do let me know.

What do you think of introducing shutdown peers? We could whitelist uses of AsyncShutdown and shutdown event loop spinning (not entirely sure how to do that, but I'm confident there's a way) and require reviews from shutdown peers before anybody introduces shutdown creep!

Flags: needinfo?(dothayer)

I really hate to be so argumentative here, but I want to understand precisely what we would be getting before adding complexity here. I do obviously want the best and most future-friendly solution for shutdown, though.

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #6)

Typically, by spinning the event loop, code loses the consistency guarantees that we, as developers, are used to. We expect that, whenever even A is sent before event B, if we start handling event A, we'll stop handling it before we handle event B. With event loop spinning, by definition, we lose this consistency. So we have code that looks like A;B;C but is actually executed as A1;B1;C;B2;A2. This messed up with resource cleanup, shutdown order, etc.

I can see why this is a conceptual problem for cases where we spin the event loop inside of an event. However, in our case, we spin the event loop inside of a function which is called statically after exiting the application's main run loop. So event ordering should be unaffected by this change.

More precisely, the main differences are:

  1. AsyncShutdown will detect errors and submit detailed crash reports;

I mean, AsyncShutdown will time out and report an error based on what things we were waiting on. In our case, the nsTerminator would time us out and we'd get a crash report which contains a stack indicating where the main thread was, which would tell us what we were waiting on. It would be the same overall information, no?

  1. regardless of the implementation of AsyncShutdown, you're working with Promises, which means that any asynchronicity in your code is marked explicitly, rather than hidden behind a function call.

So, I can think of two ways to make this particular bit of code asynchronous in that manner. 1) dispatch a dummy event to the cert_storage event queue, which will need to go through rust, and need to hold a reference to our promise, and then when we get it back, we check the code which we added in this bug to see if we've added any other events to the queue in the mean time, at which point we do the whole thing again. 2) wrap all uses of nsICertStorage in some callback tracking aggregate promise, and then trust that we don't dispatch to the cert_storage event queue through any other means now or in the future. We also in either case would need to call out to a new endpoint into AsyncShutdown from AppShutdown in order to spin its event loop to wait on this promise to resolve.

This is a nontrivial amount of code and a nontrivial amount of complexity for a benefit which I still do not understand. It is not as if everything we do during shutdown goes through AsyncShutdown today - quite a large amount does not, never will, and in some cases cannot - so I don't think we can make the argument that we are simplifying things by ensuring all shutdown goes through one API.

Good point: if there are blockers that are not useful, we definitely need to get rid of them.

Let's do a PSA on this!

The tricky thing for me here is that they're not useless. They are useful for leak checking, and that is also where a lot of them seem to come from - failing leak checks in CI. I've been wanting to do a PSA for some time now but the best approach to resolving this still hasn't crystallized for me yet. I think for each phase that we move fully out of the _exit(0) ("fast shutdown") path, we rename everything to do with that to make it clear that it exists only for leak checking and will not run under normal, non-CI circumstances. I was waiting to propose a final form for all of this until seeing clear numbers demonstrating the benefits of fast shutdown, which I think we now have.

So I would love to work with you on a proposal and a corresponding PSA. What I want at the end of this though is a fast shutdown "garden", where everything in that garden has been moved out from where it used to be, and made certain that it's only doing what it needs to do and that it belongs in shutdown. Everything that doesn't make it into the garden is left in the field for leak checking. The basic approach to building this garden so far is to work backwards through shutdown, phase by phase, and pull out only the pieces that make sense, leaving everything else behind. The PSA would basically just be saying "we're moving only the important bits of shutdown to <garden> - if you have something which you're certain belongs in <garden>, please reach out to us."

What do you think of introducing shutdown peers? We could whitelist uses of AsyncShutdown and shutdown event loop spinning (not entirely sure how to do that, but I'm confident there's a way) and require reviews from shutdown peers before anybody introduces shutdown creep!

I think a shutdown module would be a good idea, yes.

Flags: needinfo?(dothayer) → needinfo?(dteller)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #7)

I really hate to be so argumentative here, but I want to understand precisely what we would be getting before adding complexity here. I do obviously want the best and most future-friendly solution for shutdown, though.

That makes entire sense. Do you wish to take the conversation to chat.mozilla.org?

Flags: needinfo?(dteller)
Regressions: 1665295
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: