Open Bug 1847830 Opened 11 months ago Updated 11 months ago

[meta] Experiment with OOM Handling Behaviour

Categories

(Core :: JavaScript Engine, task, P3)

task

Tracking

()

People

(Reporter: mgaudet, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: meta)

We've been talking for years about getting rid of OOM handling.

We can set ourselves up for success by setting up the conditions such that we could run an experiment.

Depends on: 1847831

(In reply to Matthew Gaudet (he/him) [:mgaudet] from comment #0)

We've been talking for years about getting rid of OOM handling.

And we never committed to do anything because there is not one OOM handling case, but multiple OOM handling cases based on what the code is currently doing, and that this has the potential to reduce the effectiveness of fuzzing on the JS engine.

In the past we had to handle OOM because failures to handle OOM would crash Firefox, and Firefox was still running in a single process.
Nowadays, only the parent process risk taking the entire browser if it were to crash on OOM. Which initiated this discussion from people who are not used to deal with it. However, our ability to handle OOM is part of some embedders reasoning for using SpiderMonkey and not other engines, in processes which are not following the same multi-process model used by Firefox.

For this reason, the Parser and IonMonkey are using the LifoAlloc as an allocation backend. The intent being that if there is any failure, such as OOM, instead of recovering from this inconsistent state, we trash everything which is dealing with the inconsistent state, and start again from scratch.

The problem with handling OOM, is the concern about recovering from inconsistent state, and the fact that we have to do failure propagation to some recovery handler.

In the past what was suggested was to replace OOM by crashes, thus removing the recovery and removing the error propagation.

-- Personal notes:

While I understand the concern about error recovery being a source of errors, I disapprove the fact that we should be crashing instead of handling it. What I suggest is that we should properly define what are the scope of no-op / discard actions in case of OOM failures.

We could add OOMNoopScope and OOMDiscardScope, which when compiled in debug mode, and ran under OOMTest function can help us track every mutation and ensure that none of these mutations are remaining past any OOM failures. This way the code would both be self-documenting, and we could leverage fuzzing in our mission to provide a well defined OOM handling semantic.

Crashing on OOM is like giving up, and relying on the system to discard the process on OOM.
However, Firefox parent process might have more options if we were to properly handle OOM, such as saving some state to disk, or killing child processes to help the system recover some of its memory.

Doing experiments doesn't commit us to any particular course of action.

We have a wide range of possible options in terms of how much we invest in handling OOM. At the low end of the scale, we could simply crash on any OOM. This is basically what Gecko does. We could carve out a subset of known large allocations (like wasm memories) where we handle OOM, and crash on the rest. We could stick with the status quo, where we propagate most OOMs, crash on the tricky ones, and rely on oomTest to find bugs in our error-handling code. At the high end, we could go further and implement things like OOMDiscardScope.

The amount of time and effort we invest should be proportional to the benefit we get. If OOM handling prevents a lot of crashes, then its value goes up. If it prevents fewer crashes, then its value goes down. If it prevents very few crashes, but adds poorly tested code paths that are exploitable, its value to users could even be negative.

We don't currently have great empirical evidence for a big benefit from OOM handling. I wrote this comment five years ago; according to those numbers, only 0.02% of Firefox crashes involve any sort of JSOutOfMemory annotation. On the other hand, we do seem to recover from a non-trivial number of them, at least long enough to run a GC. In theory, OS-level OOM killers and the lack of OOM handling in Gecko also undermine some of our OOM-handling. In practice, who knows how important those effects are.

Instead of spending our time theory-crafting, we should just collect more evidence. If that evidence shows that OOM handling is preventing a lot of crashes, we could beef up our current implementation. If it shows otherwise, we can pull back on OOM handling.

only 0.02% of Firefox crashes involve any sort of JSOutOfMemory annotation

We know from crash stats that ~5% of Firefox crashes are "OOM | small". Does this mean that gecko is two orders of magnitude more likely to hit OOM than Spidermonkey? That seems unlikely to me.

I'm in favour of handling OOMs but I agree with gathering more data about this. We should probably write up all our arguments for/against in one place, but this bug is not that place.

I do want to use this bug to drive experimentation and data gathering.

Here's some data I'd like to see:

  1. What is the likelihood that if we recover an OOM in the JS engine that we're able to subsequently recover in Gecko and continue executing to session shutdown.

  2. If we were to explicitly annotate the OOM handling we think would be critical to maintain (this is partially the thinking behind Bug 1847831), can we identify statically cases where we could change fallible operations to infallible.

    Part of my reasoning behind being in favour of removing OOM handling is that we have operations which are fallible due to OOM but are infallible by specification; this makes implementations of specifications appreciably more complicated and error prone since operations that should be infallible in specification become fallible in implementation. Great care needs to be taken to ensure that an inopportune OOM doesn't lead to an invalid state. While SpiderMonkey has tools like the oomTest function to help with this, Gecko does not.

Some of these I have ideas of how we can get that information:

  1. For (1) above, we have the JSOutOfMemory Crash annotation populated in OnGC. If we were to have this added to both the shutdown and crash pings, we could figure out an approximation of what fraction of Firefox sessions recover from out of memory (as Iain points out, Linux OOM killing does present an issue, but IIRC Windows has pre-commit semantics and will OOM instead).

  2. For (2) above, it feels like we might be able to get this information via sixgill, since we already have a whole program callgraph (or approximation therein) from the hazard analysis. I'm sure it wouldn't be straightforward, but it also seems possible.

My personal take: I want to be able to remove OOM handling, because I think it's a tax we pay on development without a lot of benefit -- but I don't have good evidence here; this is an immortal topic however and we should -have- evidence.

I honestly think that my hope that we can make an infallible API surface that makes sense for Gecko is the least likely outcome and the most likely reason we'd end up keeping OOM handling: While it's plausible that we can make some operations infallible that are currently fallible, there are just too many 'regular JS' operations that can run script and throw exceptions for me to think we're going to be able to do this.

Regarding embedders, I've not seen any with real OOM handling; see this document near the bottom where I looked into this a year or more ago.

(In reply to Jon Coppeard (:jonco) from comment #3)

only 0.02% of Firefox crashes involve any sort of JSOutOfMemory annotation

We know from crash stats that ~5% of Firefox crashes are "OOM | small". Does this mean that gecko is two orders of magnitude more likely to hit OOM than Spidermonkey? That seems unlikely to me.

Hmm, that's a good question. The sql query I was using is here. I haven't been able to figure out how to refresh it. (It uses old telemetry stuff, so it might not even be possible to refresh it.) You're right that the numbers aren't really compatible with OOM|small being 5% of overall crashes.

Another argument for getting better data!

Poking around at crash stats: in the last six months, there have been ~9.8M crashes. Of those, 1.24M have a JSOutOfMemory annotation. Roughly 3/4 of those have crash signatures that start with OOM, which I think means that we handle OOM in SM before crashing on OOM in Gecko. (Edit: Matt points out that AutoEnterUnsafeOOMRegion also shows up as OOM|small, so it's probably better to say that we have a handled OOM followed by an unhandled OOM.)

(It's possible that some of the remaining crashes are also OOM-related; looking at the distribution of all crash signatures that don't start with "OOM", there's a difference between crashes that have JSOutOfMemory annotations and crashes that don't. Setting aside "Empty Minidump", the top 5 most common JSOutOfMemory signatures all have OOM-related bugs associated with them.)

I can't find a way to search for specific values of the annotation to determine the ratio of Reported (we crash before the next GC) vs Recovered (we survive at least until the completion of the next GC). The old SQL query had a 2:1 reported:recovered ratio. Doing a quick spot check by hand, the ratio seems closer to 4:1; Out of a few different samples of 10 crashes with JSOutOfMemory, I usually saw 2ish recoveries.

Severity: -- → S3
Priority: -- → P3
Severity: S3 → N/A
You need to log in before you can comment on or make changes to this bug.