Open Bug 1005844 Opened 10 years ago Updated 2 years ago

jemalloc can miss usable regions in low-memory conditions

Categories

(Core :: Memory Allocator, defect)

32 Branch
x86_64
Windows 7
defect

Tracking

()

People

(Reporter: away, Unassigned)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(1 file, 17 obsolete files)

(deleted), patch
ehoogeveen
: review?
glandium
Details | Diff | Splinter Review
jemalloc's memory mapping works like this [1]:
1. Map a 1MB block
2. If the result is 1MB aligned, we're done
3. Otherwise, try to slide the region up to the next MB
4. If that fails, map a 2MB block and carve it down to size

While investigating OOM crashes, I have seen reports with hundreds of aligned 1MB regions, but no 2MB regions, so if the first 1MB block is misaligned then the entire allocation fails, even though there were other potential candidates. Reports [2] and [3] both have 300+ usable regions.

[1] http://hg.mozilla.org/mozilla-central/file/b8a7881a83fa/memory/mozjemalloc/jemalloc.c#l2655
[2] https://crash-stats.mozilla.com/report/index/893b9010-d512-4a18-96df-056152140427
[3] https://crash-stats.mozilla.com/report/index/021fa736-84a8-4d11-8125-56e872140429
Whiteboard: [MemShrink]
Hmm, so the problem here is that we can't make pages_map(NULL, ...) give us back a new region each time without leaving the previous region mapped (at least temporarily). In theory we could just

0) Use the current algorithm until the 2MB map attempt fails
1) Map a 1MB region
2) If unaligned, try to slide up to the next MB
3) If that fails, map the original 1MB region again

... and do 1-3 in a loop until it succeeds or the 1MB pages_map fails. Regardless of whether it succeeds or fails, we can then unmap the temporary mappings. This could have pretty bad worst case behavior though - perhaps we could limit the loop to n attempts?
I think we're primarily worried about Windows here.
Something like this? I implemented the idea from comment #1 with a cap on the amount of attempts before we give up. I arbitrarily set FALLBACK_ALLOC_ATTEMPTS to 8 - on the one hand, failing means failing the allocation, on the other we probably don't want to increase peak memory usage too much just to find a free chunk. I've checked that this builds on Windows with --enable-jemalloc and browsed around a bit, but consider this mostly untested (certainly I haven't tested the actual mechanism in a low memory situation).

This won't prevent the fragmentation from happening in the first place - I imagine that deallocating 1MiB chunks as the memory is no longer needed will inevitably create lots of free (aligned) 1MiB chunks over time, and the only way to slow that down would be to put the loop in this patch higher up in the algorithm (but that would be really slow).

I'm asking for feedback instead of review mostly because IIRC we don't like diverging from upstream jemalloc, and I'd like to know if the approach is sound before I look into upstreaming this :)
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8417649 - Flags: feedback?(dmajor)
bug 956501 backported the chunk allocation logic from jemalloc 3, but that didn't apply to windows, which is still using the old one. I'd first update the code to use the new logic from jemalloc 3 on windows too, and then see if there's something left to do. If there is, then it should be discussed upstream.
Another idea might be to use an OS facility (e.g. VirtualQuery) to find a block ourselves. Potentially slow, but better than crashing.

But Mike is the expert here -- I guess the first thing to try would be the new jemalloc.

By the way, I opened bug 1005849 for this issue in the JS allocator.
I had a look at the latest code, at https://github.com/jemalloc/jemalloc/blob/dev/src/chunk_mmap.c#L143

It looks like they no longer try to slide the region up to the next MB, and instead try the bigger allocation right away. They use a few helper functions instead of doing everything in chunk_alloc_mmap and subtract the size of a page, but otherwise it's the same idea, so I think something like my patch would still be useful. An upstream patch would look somewhat different given the amount of refactoring there, though.
I'm going to have another look tomorrow and refactor our jemalloc allocation logic to look more like jemalloc3 tip, so the new pass is easier to put into context.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #7)
> I'm going to have another look tomorrow and refactor our jemalloc allocation
> logic to look more like jemalloc3 tip, so the new pass is easier to put into
> context.

As I said, we already have the jemalloc3 code backported, it's just in a #if branch that excludes windows.
So in practice, this is all you should need to use the jemalloc3 code on windows:

diff --git a/memory/mozjemalloc/jemalloc.c b/memory/mozjemalloc/jemalloc.c
--- a/memory/mozjemalloc/jemalloc.c
+++ b/memory/mozjemalloc/jemalloc.c
@@ -2622,17 +2622,17 @@ malloc_rtree_set(malloc_rtree_t *rtree, 
        subkey = (key << lshift) >> ((SIZEOF_PTR << 3) - bits);
        node[subkey] = val;
        malloc_spin_unlock(&rtree->lock);
 
        return (false);
 }
 #endif
 
-#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)
+#if defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)
 
 /* Allocate an aligned chunk while maintaining a 1:1 correspondence between
  * mmap and unmap calls.  This is important on Windows, but not elsewhere. */
 static void *
 chunk_alloc_mmap(size_t size, bool pagefile)
 {
        void *ret;
 #ifndef JEMALLOC_USES_MAP_ALIGN
Comment on attachment 8417649 [details] [diff] [review]
Add a fallback loop to find 1 MiB chunks when no 2 MiB chunks are available.

Looks like this patch will be replaced, please flag glandium for the new one.
Attachment #8417649 - Flags: feedback?(dmajor)
dmajor++

> jemalloc's memory mapping works like this [1]:
> 1. Map a 1MB block
> 2. If the result is 1MB aligned, we're done
> 3. Otherwise, try to slide the region up to the next MB
> 4. If that fails, map a 2MB block and carve it down to size

How does one "slide the region up"? I have no idea how that would work on Unix; is it a Windows concept?
It doesn't look like there was a reason given for not enabling the new code on Windows - it was simply a pre-existing #ifdef as far as I can tell. So let's try enabling the new code on Windows as Mike indicated and see if anything explodes. That won't fix this bug, but we won't have to worry about the two diverging codepaths on our main platforms (I don't know in what environments JEMALLOC_USES_MAP_ALIGN or MALLOC_PAGEFILE would be defined, though in the former case the allocation is trivial).

(In reply to Nicholas Nethercote [:njn] from comment #11)
> How does one "slide the region up"? I have no idea how that would work on
> Unix; is it a Windows concept?

They unmap the 1MiB region they got, then request a 1MiB region at |ret - offset + chunksize|, hoping that the chunk of contiguous memory extends far enough to find an aligned 1MiB chunk. jonco noted in bug 956501 that this will always fail on systems that allocate memory in decreasing address order, and suggested a patch to check both directions. However, jemalloc3 removes this fastpath altogether, falling back to the slow path right away instead.

In this bug I'd like to add a 'last ditch' slow path to look for 1MiB chunks if both the fast path and the slow path failed - it might be worth trying the 'slide' approach again for this since it's a slow path anyway.
Attached patch Part 1: Switch Windows over to the jemalloc3 allocation logic. (obsolete) (deleted) — — Splinter Review
This removes the #ifdef for Windows and imports the pages_trim code from jemalloc tip to make it work on Windows as well (pages_unmap doesn't support the size parameter on Windows - it's all or nothing, so we have to unmap and remap the smaller area).

I kicked off a Windows-only try run, looks good: https://tbpl.mozilla.org/?tree=Try&rev=2c056ff3de57

This won't fix the problem in this bug - if anything, it might make it worse if we're in the unfortunate situation where the first chunk of free memory is unaligned, contains an aligned 1MB chunk, but is smaller than |size + alignment - page| - since this will always bump us down to the slow method, and that will jump past it. But that situation might be rare, and smaller allocations could fill that hole anyway. Part 2 (to be written) should fix this when available memory becomes low enough or fragmentation becomes high enough that the slow method fails.
Attachment #8417649 - Attachment is obsolete: true
Attachment #8418866 - Flags: review?(mh+mozilla)
This adds a 'last ditch' pass, called if chunk_alloc_mmap_slow fails. I separated the Windows and non-Windows logic since on Windows, map and unmap calls have to be matched (unmap can only unmap previously mapped regions).

On Windows, I believe VirtualAlloc allocates memory in increasing address order unless the MEM_TOP_DOWN flag is passed. jemalloc doesn't use it (and it generates O(n^2) behavior [1]), so the code tries to 'slide' the memory region upward to be aligned by trying to allocate at |ret + alignment - offset|.

For non-Windows platforms, I reintroduced the code that jonco suggested in bug 956501 to try both directions (since the OS allocator may be handing out addresses in decreasing order). jonco, I didn't quite copy your code, so could you have a look to see if I got it right? Note I will probably suggest a similar approach to this for bug 1005849.

I am of course open to suggestions on naming, and setting LAST_DITCH_ALLOC_ATTEMPTS to 8 is still arbitrary. I don't know how hard it would be to get data on how many unalignable 1MiB chunks we have to skip on average before finding a suitable one, once this pass is hit. This is a slow path, so perhaps we could gather telemetry? That is, if this approach is acceptable.

Oh, and the style in this file makes it very awkward to stay within 80 columns (it uses 8 spaces or an 8-space tab per level of indentation), so I decided to go slightly over in places. Other lines in the file already do this.

I checked that this builds, but I haven't sent it to try - but then, try shouldn't hit this pass at all unless we're already OOMing currently.

[1] http://randomascii.wordpress.com/2011/08/05/making-virtualalloc-arbitrarily-slower/
Attachment #8418966 - Flags: review?(mh+mozilla)
Attachment #8418966 - Flags: feedback?(jcoppeard)
Comment on attachment 8418866 [details] [diff] [review]
Part 1: Switch Windows over to the jemalloc3 allocation logic.

Review of attachment 8418866 [details] [diff] [review]:
-----------------------------------------------------------------

This brings us to the code in jemalloc 3, and this is good.
Attachment #8418866 - Flags: review?(mh+mozilla) → review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #14)
> LAST_DITCH_ALLOC_ATTEMPTS to 8 is still arbitrary. I don't know how hard it
> would be to get data on how many unalignable 1MiB chunks we have to skip on
> average before finding a suitable one, once this pass is hit. This is a slow
> path, so perhaps we could gather telemetry? That is, if this approach is
> acceptable.

If telemetry is not practical then another option is to simulate on existing crash data. I am generally seeing <= 7 misaligned blocks per crash report.
> I am generally seeing <= 7 misaligned blocks per crash report.
To clarify: those are from OOM crash reports.
The previous code had a comment that looked like an explanation for the MOZ_MEMORY_WINDOWS condition.

 /* Allocate an aligned chunk while maintaining a 1:1 correspondence between
  * mmap and unmap calls.  This is important on Windows, but not elsewhere. */

Should that be moved to match the new location of the Windows code?
Comment on attachment 8418966 [details] [diff] [review]
Part 2: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

Review of attachment 8418966 [details] [diff] [review]:
-----------------------------------------------------------------

This looks sound. But I'd rather jemalloc upstream look at it (and we should definitely get this upstream).

That said, I'm concerned that once you reach a state where you need to use this method, you're going to use it every single time, while, obviously, you already know the first n allocations are going to fail anyways.

I wonder if it wouldn't make sense to try to give the last (aligned) address we got as a hint to the first pages_map in chunk_alloc_mmap or at some other point in the process.

::: memory/mozjemalloc/jemalloc.c
@@ +2744,5 @@
>  
> +#define LAST_DITCH_ALLOC_ATTEMPTS 8
> +
> +/*
> + * In a low memory or high fragmentation situation, alignable chunks of the 

trailing space.

@@ +2818,5 @@
>                  return (NULL);
>          do {
>                  pages = pages_map(NULL, alloc_size, -1);
>                  if (pages == NULL)
> +                        return (chunk_alloc_mmap_last_ditch(size, alignment));

I think this should rather be in chunk_alloc_mmap, if chunk_alloc_mmap_slow fails.
Attachment #8418966 - Flags: review?(mh+mozilla)
Attachment #8418966 - Flags: review?(jasone)
Attachment #8418966 - Flags: feedback+
Thanks for the review! Carrying forward r=glandium.

(In reply to David Major [:dmajor] (UTC+12) from comment #18)
> Should that be moved to match the new location of the Windows code?

It definitely makes no sense to have that comment there anymore. I would put it in pages_trim, but that function now matches upstream jemalloc3 so I just removed it instead.
Attachment #8418866 - Attachment is obsolete: true
Attachment #8419109 - Flags: review+
(In reply to David Major [:dmajor] (UTC+12) from comment #16)
> If telemetry is not practical then another option is to simulate on existing
> crash data. I am generally seeing <= 7 misaligned blocks per crash report.

Thanks. So 8 should be enough while giving us at most 8 MiB of additional peak memory usage (not relevant to us, but perhaps relevant to other processes).

(In reply to Mike Hommey [:glandium] from comment #19)
> This looks sound. But I'd rather jemalloc upstream look at it (and we should
> definitely get this upstream).

Agreed. This should apply to jemalloc tip with only minimal changes - I know very little about working with git though.

> That said, I'm concerned that once you reach a state where you need to use
> this method, you're going to use it every single time, while, obviously, you
> already know the first n allocations are going to fail anyways.

Yeah, if none of the first n mapped blocks can be aligned, we're always going to fall back to the slow path and do the map, try-to-align, skip and unmap dance.

If they *can* be aligned and used, we'll start getting already aligned chunks back from the allocator and move back onto the fast path. But if there's even one unalignable chunk that we always get back, we'll be on the slow path forever (unless some other allocator happens to free enough of the surrounding memory to make the chunk alignable).

> I wonder if it wouldn't make sense to try to give the last (aligned) address
> we got as a hint to the first pages_map in chunk_alloc_mmap or at some other
> point in the process.

I was thinking about this, but even if we store this address I'm not sure how much we could actually *do* with it. We can ask the OS's allocator to map a chunk of memory at a certain address, but we can't tell it to avoid mapping to addresses below (or above) a certain threshold.

We could still skip past the fast path if such a hint is present, and skip past the normal slow path, or avoid trying to align the unalignable chunks - but we'd still have to temporarily map them each time. I wonder if it's worth the complexity.

> trailing space.

Fixed.

> > +                        return (chunk_alloc_mmap_last_ditch(size, alignment));
> 
> I think this should rather be in chunk_alloc_mmap, if chunk_alloc_mmap_slow
> fails.

Done. I put it there originally because I didn't want to make the fast path slower, but this works just as well.
Attachment #8418966 - Attachment is obsolete: true
Attachment #8418966 - Flags: review?(jasone)
Attachment #8418966 - Flags: feedback?(jcoppeard)
Attachment #8419112 - Flags: review?(jasone)
Comment on attachment 8419112 [details] [diff] [review]
Part 2 v2: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

Oh, and I'd still like jonco to have a quick look at this :)
Attachment #8419112 - Flags: feedback?(jcoppeard)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #21)
> > I wonder if it wouldn't make sense to try to give the last (aligned) address
> > we got as a hint to the first pages_map in chunk_alloc_mmap or at some other
> > point in the process.
> 
> I was thinking about this, but even if we store this address I'm not sure
> how much we could actually *do* with it. We can ask the OS's allocator to
> map a chunk of memory at a certain address, but we can't tell it to avoid
> mapping to addresses below (or above) a certain threshold.

I'm not sure I follow. Technically, if you mmap/VirtualAlloc for 1MB giving the address of the last MB you got, the OS is likely to give you an aligned 1MB block adjacent to that last MB you got. I'm not sure what implication that has on performance, though.

PS: since you're looking into jemalloc at the moment, I'll take on the occasion that it would be worthwhile completing the transition to jemalloc3.
(In reply to Mike Hommey [:glandium] from comment #23)
> I'm not sure I follow. Technically, if you mmap/VirtualAlloc for 1MB giving
> the address of the last MB you got, the OS is likely to give you an aligned
> 1MB block adjacent to that last MB you got. I'm not sure what implication
> that has on performance, though.

As far as I can tell from the documentation [1], VirtualAlloc will just fail if the region starting at lpAddress is already reserved. So it'll either give you back the address you handed in (rounded down to the start of a page), or fail and return NULL.

mmap *does* seem to use the parameter as a hint [2], so it might be worthwhile implementing some logic for that.

> PS: since you're looking into jemalloc at the moment, I'll take on the
> occasion that it would be worthwhile completing the transition to jemalloc3.

I don't know if I have time to take that on ... I'm not sure how much it would involve. FWIW, RyanVM linked a few try runs to me on IRC of upgrading to the latest stable and dev tip of jemalloc, and they were pretty colorful.

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887%28v=vs.85%29.aspx
[2] http://man7.org/linux/man-pages/man2/mmap.2.html
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #24)
> I don't know if I have time to take that on ... I'm not sure how much it
> would involve. FWIW, RyanVM linked a few try runs to me on IRC of upgrading
> to the latest stable and dev tip of jemalloc, and they were pretty colorful.

Funny story on that, the ones I sent you didn't actually build with jemalloc3 enabled. Whoops :P

Here's the real links:
3.6: https://tbpl.mozilla.org/?tree=Try&rev=8c578cbcf58e
Tip: https://tbpl.mozilla.org/?tree=Try&rev=01d5055d26b7

So basically Windows is busted and every other platform works fine. Note that jemalloc is only used on opt builds, so the debug builds on those runs were essentially no-ops.
See bug 762449 and its opened dependencies as to what's left for jemalloc3 (one more to file being to update to jemalloc 3.6, which may or may not fix bug 762448).
Comment on attachment 8419112 [details] [diff] [review]
Part 2 v2: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

Review of attachment 8419112 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +2790,5 @@
> +                }
> +                /* Try to make an aligned chunk by allocating memory before the start. */
> +                if (pages_map((void *)((uintptr_t)ret - offset), offset, -1)) {
> +                        /* Clean up unneeded trailing space. */
> +                        pages_unmap((void *)((uintptr_t)ret + alignment - offset), offset);

Shouldn't this be |ret + size - offset| for the start address we want to unmap?

Otherwise, this looks good to me.
Attachment #8419112 - Flags: feedback?(jcoppeard) → feedback+
(In reply to Jon Coppeard (:jonco) from comment #27)
> Shouldn't this be |ret + size - offset| for the start address we want to
> unmap?

Thanks for catching that! Looking at the logic again also made me realize something else - given that mmap only uses the address parameter as a hint, we should make sure we actually got back the address we requested.
This version fixes the above issues.

I sent this to try with a patch to make the new slow path the *only* path for testing purposes, and it seems to be working just fine: https://tbpl.mozilla.org/?tree=Try&rev=6e4d3e6db0e5

However, I'm not sure how to go about testing this in a low memory situation.
Attachment #8419112 - Attachment is obsolete: true
Attachment #8419112 - Flags: review?(jasone)
Attachment #8421646 - Flags: review?(jasone)
I noticed that pages_map already returns null if the returned address doesn't match the requested address, so no further checking is needed! That also means we can't use previously returned addresses as a hint without modifying pages_map, however (for future reference).

Sorry about the churn!
Attachment #8421646 - Attachment is obsolete: true
Attachment #8421646 - Flags: review?(jasone)
Attachment #8421721 - Flags: review?(jasone)
Whiteboard: [MemShrink] → [MemShrink:P1]
Comment on attachment 8421721 [details] [diff] [review]
Part 2 v4: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

My patches in sister bug 1005849 are about to land, and the algorithm has changed significantly, so cancelling review for now. The new code is substantially more complex, but it has several important advantages that I think make the complexity worthwhile.

Mike, since we haven't heard from Jason yet, would it be possible to move forward here once I port over the code from bug 1005849 and get this into Firefox? Bug 1005849 includes some tests so I'm fairly confident in the correctness already. I don't mind helping get this upstreamed, but it would be nice to get this improvement in soon.
Attachment #8421721 - Flags: review?(jasone)
Flags: needinfo?(mh+mozilla)
This is basically a carbon copy of part 3 in bug 1005849. I just ported it over to C-style jemalloc and made sure it compiles. Didn't take long, since there wasn't a lot of C++-only stuff in there to begin with. I made sure this compiles and ran with it for a bit on Windows, but I'm eager to see how this does on Try. In theory it should work just fine, though, considering it's just a copy of bug 1005849 and I tested that fairly extensively.

Mike, tentatively asking you to review this again pending your answer to my question above. Note that Terrence already reviewed the algorithm for bug 1005849 (though it wouldn't hurt to have another who understands it).
Attachment #8421721 - Attachment is obsolete: true
Attachment #8428196 - Flags: review?(mh+mozilla)
Oh, I should explain the basic algorithm. This comes from an idea that Hugh Nougher sent me via e-mail.

The idea is that if the allocator gives us an unaligned chunk, we hold onto the unaligned *portion* of that chunk and then ask for another. If the contiguous free region of memory starting at the aligned address inside our earlier chunk is big enough to hold another chunk, the allocator will give us that one and we're done. 

If it doesn't, we'll get the next chunk instead. There's a chance that this new chunk won't be aligned, but there's also a good chance that it *will* be aligned, especially if earlier deallocations left behind aligned, chunk-sized holes.

So this has two advantages: not only is it much more likely to give us an aligned chunk, even if the earlier region couldn't be aligned, but it also gives us a new chunk that might be alignable even if it isn't aligned. The first advantage makes this worth attempting once in an 'intermediate' path before falling back to the slow path. The second advantage simplifies and speeds up the last ditch pass.

However, this approach does have the drawback that it makes the heap growth direction much more important. We have to decide which unaligned area to keep and which to discard, and this differs depending on whether the heap grows up or down. If we guess *wrong*, this also introduces a threading issue - the area we deallocated may have been filled by something else in the meantime, so we might not be able to get our original chunk back to try the other direction.

This makes the Linux code significantly more complicated than before, and because we want the intermediate path to be at least somewhat fast (and suffer from threading issues as little as possible), I introduced a variable (growth_direction) to keep track of whether or not we guessed right, and choose the direction to try first based on that. The amount of logic for this is fairly minimal, but it does introduce some subtlety.

Another complication is that since we're now saving an unaligned region of variable length, we have to keep track of the length in order to be able to unmap it later. This isn't a problem on Windows, where the size parameter to unmap isn't used, but it's easier for all platforms to share the same code in this case. In the last ditch pass, this means we need another array to hold the temporary chunk sizes. In the intermediate pass, it means we need two extra variables.

In the end I think I managed to make the interface clean enough that the complexity is manageable, and I think the advantage of being able to deal with an entire new class of situations before falling back to the slow path* is enough of one that the complexity is an acceptable tradeoff - but others might disagree with me on that.

* specifically, the situation where the first chunk is unalignable but the second chunk is already aligned, which may be common and cause a lot of fragmentation if it sticks around.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #32)
> but I'm eager to see how this does on Try.

Looking good! https://tbpl.mozilla.org/?tree=Try&rev=613cf6345076
One thing I'm somewhat worried about: growth_direction probably won't be changed that frequently (we expect the fast path to succeed most of the time), but in theory I don't think there's anything stopping two threads from racing trying to change it. Now, it's not really a big deal if it's off by a few - as long as the direction is solidly defined, it'll have the same effect every time. I also don't know if we have any cross-platform atomic operations that work in C. But I thought it was worth bringing up, since this wasn't an issue in bug 1005849 (where it's a member variable).
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #35)
> I also don't know if we have any cross-platform atomic
> operations that work in C. But I thought it was worth bringing up, since
> this wasn't an issue in bug 1005849 (where it's a member variable).

We don't.  But if all you need is atomic operations on a single word, it's probably not difficult to reconstruct from the relevant platform primitives.

Raciness in the memory allocator seems like a recipe for headaches later on down the line.
Thanks, and agreed. I looked into this a bit, and it's not pretty:

Windows:
    #include <WinBase.h>
    InterlockedIncrement
    InterlockedDecrement
OSX:
    #include <libkern/OSAtomic.h>
    OSAtomicIncrement32
    OSAtomicDecrement32
Linux:
    __sync_fetch_and_add
    __sync_fetch_and_sub
    (GCC/Clang builtins, uncertain ARM support?)
Solaris:
    #include <atomic.h>
    atomic_inc_32
    atomic_dec_32
    (unsigned only)
FreeBSD:
    #include <sys/types.h>
    #include <machine/atomic.h>
    atomic_add_32
    atomic_subtract_32
    (no increment or decrement, unsigned only)

There's also C11 stdatomic.h, but it doesn't look like that's broadly available yet. The GCC/Clang builtins might just work everywhere but Windows, in which case it's not too bad.
Thanks for looking into this.  I think we can get away with supporting two separate cases:

- Windows
- GCC/Clang builtins

(These are essentially the same cases we support for mozilla/Atomics.h.)

The GCC/Clang builtins should work everywhere except possibly Solaris.  If somebody really wants to compile Firefox on Solaris with Sun's compiler, they can submit patches.  We haven't had any complaints about the lack of "proper" Solaris support in Atomics.h, though...
We actually don't even need to worry about Windows here, since growth_direction isn't even defined there (we know addresses grow up in Windows). So the only question remaining is whether to use the old (non-standard) __sync builtins or the new __c11_atomic (Clang) and __atomic (GCC 4.7+) builtins. We should be able to detect these at compile time.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #39)
> We actually don't even need to worry about Windows here, since
> growth_direction isn't even defined there (we know addresses grow up in
> Windows). So the only question remaining is whether to use the old
> (non-standard) __sync builtins or the new __c11_atomic (Clang) and __atomic
> (GCC 4.7+) builtins. We should be able to detect these at compile time.

I don't think it's worth trying to switch between the two based on a configure check.  And since we have to support GCC 4.4 (B2G builds) for now, we need to go with the __sync family of builtins.
(In reply to Nathan Froyd (:froydnj) from comment #40)
> I don't think it's worth trying to switch between the two based on a
> configure check.  And since we have to support GCC 4.4 (B2G builds) for now,
> we need to go with the __sync family of builtins.

That just makes my life easier :) This patch fixes all accesses to growth_direction to be atomic. The __sync builtins only support the strictest, sequential acquire-release ordering, but performance shouldn't really be an issue here. I made sure it still builds everywhere, doesn't seem to cause any problems: https://tbpl.mozilla.org/?tree=Try&rev=822a786352a4
Attachment #8428196 - Attachment is obsolete: true
Attachment #8428196 - Flags: review?(mh+mozilla)
Attachment #8428358 - Flags: review?(mh+mozilla)
And some tests on OSX (a Unixy system where addresses grow up, exercising the growth_direction logic): https://tbpl.mozilla.org/?tree=Try&rev=a38651528f9d
Comment on attachment 8428358 [details] [diff] [review]
Part 2 v6: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

Review of attachment 8428358 [details] [diff] [review]:
-----------------------------------------------------------------

The code looks to me it does what it advertizes, but I'm not convinced it's worth the complexity.

For instance, the first attempt after the first non aligned allocation, and before the slow path, on unix, could just pages_map(initial-offset, offset) or pages_map(initial+size, offset) depending on the direction. If you're not raced by another thread, you're *very* likely to get the right block. If you don't, it's fine to fallback to the slow path. In fact, you're doing more mmap/munmap syscalls before the slow path than the slow path itself. I'm also not convinced the last ditch needs all that complexity.

::: memory/mozjemalloc/jemalloc.c
@@ +2739,5 @@
>          }
>  #endif
>  }
>  
> +#ifndef MOZ_MEMORY_WINDOWS

This whole thing should be ifndef JEMALLOC_USES_MAP_ALIGN. BTW, we should push for MAP_ALIGN support in the Linux kernel. It's in solaris, and it's freakingly useful.

@@ +2964,5 @@
>  
>          ret = pages_map(NULL, size, -1);
> +        if (ALIGNMENT_ADDR2OFFSET(ret, chunksize) == 0)
> +                return (ret);
> +

I think it would be less confusing with a kind-of-useless else.
Attachment #8428358 - Flags: review?(mh+mozilla) → feedback+
(In reply to Mike Hommey [:glandium] from comment #43)
> If you're not
> raced by another thread, you're *very* likely to get the right block. If you
> don't, it's fine to fallback to the slow path.

That will only work if the region extends far enough to be aligned - I can't speak to how likely that is, but if the first address you get is an unalignable chunk, the path will always fail - even if the next chunk up would have been aligned.

The problem with the slow path isn't so much that it's slow, but that it speeds up fragmentation by skipping past aligned chunks that were handed out previously and then freed. Of course even with the current patch, we might get an unalignable chunk followed by an unaligned chunk (or another unalignable chunk), and fall back to the over-allocating path regardless.

> In fact, you're doing more
> mmap/munmap syscalls before the slow path than the slow path itself.

We go from 2-3 system calls (mmap + munmap [+ munmap]) for the over-allocating path now, to 3 system calls (munmap + mmap + munmap) for the new path on success or 4 + 2-3 on failure (where failure means getting back an unaligned chunk).

In bug 1005849, Terrence and I decided that the calls are rare enough (a couple hundred during a run of Octane) that we might as well try hard to stave off fragmentation (within reason). Perhaps the tradeoff is different for jemalloc; I could do some measurements on how frequently chunk_alloc_mmap is called.
Oh, and note that the path you suggest (and which the original patch used) will also do two system calls on failure (if it gets back the wrong address), so the difference is 2 munmap calls (assuming the growth_direction logic is used in both cases).
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #45)
> Oh, and note that the path you suggest (and which the original patch used)
> will also do two system calls on failure (if it gets back the wrong
> address), so the difference is 2 munmap calls (assuming the growth_direction
> logic is used in both cases).

Except it's unlikely to fail. So most of the time it would just add an mmap, and that's all. Note I'm actually more concerned with the code path complexity than with the number of syscalls.
Flags: needinfo?(mh+mozilla)
Well, except on pathological cases.
Okay, I ripped out the new logic for the non-Windows path except for the growth_direction stuff (which is now just an optimization), getting rid of get_new_chunk_inner in the process (and renaming get_new_chunk to try_align_chunk). I left the new logic for the Windows path, since it's much simpler there; that's also the platform most of our address space constrained users are on, so I think it's probably worth having. Note that the second argument to pages_unmap is sometimes bogus now on Windows, but it isn't used there anyway. This patch hasn't received much testing yet.

(In reply to Mike Hommey [:glandium] from comment #43)
> > +#ifndef MOZ_MEMORY_WINDOWS
> 
> This whole thing should be ifndef JEMALLOC_USES_MAP_ALIGN. BTW, we should
> push for MAP_ALIGN support in the Linux kernel. It's in solaris, and it's
> freakingly useful.

Everything here is already in a !(defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)) block, that hasn't changed with this patch. And yeah, it'd be great if the OS exposed a way to get aligned memory (as long as it doesn't suffer from the problems this bug was filed about).

> > +        if (ALIGNMENT_ADDR2OFFSET(ret, chunksize) == 0)
> > +                return (ret);
> > +
> 
> I think it would be less confusing with a kind-of-useless else.

Done.
Attachment #8428358 - Attachment is obsolete: true
Attachment #8429614 - Flags: review?(mh+mozilla)
Try pointed out that I forgot to return a value in some code I moved to the last ditch path (I'm surprised compiling locally didn't catch this); I have that fixed locally.
I think my eyes kind of glazed over while writing some of the logic in that patch, so let me try again; sorry about the churn.
Attachment #8429614 - Attachment is obsolete: true
Attachment #8429614 - Flags: review?(mh+mozilla)
Attachment #8429660 - Flags: review?(mh+mozilla)
Found another problem with the rewritten try_align_chunk (forgot to set a_retained_addr to NULL then checked it later). Kicked off a try run with that fixed: https://tbpl.mozilla.org/?tree=Try&rev=f6ccf8b071b2
Comment on attachment 8429660 [details] [diff] [review]
Part 2 v8: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.

Review of attachment 8429660 [details] [diff] [review]:
-----------------------------------------------------------------

I need to take a look at this again with fresher eyes, but this looks good to me on a first pass. I'd still like Jason to take a look at this, though.

::: memory/mozjemalloc/jemalloc.c
@@ +2775,5 @@
> +         * just the pages we don't need. However, as we don't know a priori if addresses
> +         * are handed out in increasing or decreasing order, we have to try both
> +         * directions (depending on the environment, one will always fail).
> +         */
> +        bool addrs_grow_down = JEMALLOC_ATOMIC_LOAD(growth_direction) <= 0;

I'm tempted to say it doesn't really matter if this variable is not atomic. Worst that can happen is that if there's a race between two threads, one might take the wrong direction.
Attachment #8429660 - Flags: review?(mh+mozilla)
Attachment #8429660 - Flags: review?(jasone)
Attachment #8429660 - Flags: feedback+
> > +        bool addrs_grow_down = JEMALLOC_ATOMIC_LOAD(growth_direction) <= 0;
> 
> I'm tempted to say it doesn't really matter if this variable is not atomic.
> Worst that can happen is that if there's a race between two threads, one
> might take the wrong direction.

Yeah, there were a few more edge cases to consider previously, but now it's hard to see how this could go so wrong as to cause a problem. It's also just an optimization now - it won't cause increased fragmentation even if the first try always fails, it'll just cause two extra system calls on a path that probably isn't too hot to begin with. I can go either way on leaving it, making it non-atomic or taking it out altogether (though it doesn't really add much code).
I had a bit of a brain wave today; here's a new version that restores the 'second fast path' as it were when aligning the current chunk fails. A few changes in this version:

1) Makes growth_direction non-atomic again. It's just an optimization now, so getting it wrong is not the end of the world.
2) Bails out before trying the other direction if we're confident in the growth direction (abs(growth_direction) > 8).
3) Changes try_align_chunk back to get_new_chunk since it always returns a chunk (which may not be aligned).
4) Fixes a few bugs, including an inverted condition in the last ditch pass that probably completely broke it (doing a try run of *just* the last ditch pass now).

This brings the non-Windows behavior in line with the Windows behavior again, and is still a lot simpler than v7. If this goes in I'll probably align the GC allocator with it again.

I think this is the best of both worlds, but it does add another system call if the chunk cannot be aligned before trying the over-allocating path. This probably has a high chance of success - the way jemalloc and the GC allocator work, they're likely to leave behind aligned free chunks. But it's also just one more try at the fast path - if the chunk we get back isn't aligned, we'll still have to try the over-allocating path. It makes the last ditch pass nice and simple though.
Attachment #8429660 - Attachment is obsolete: true
Attachment #8429660 - Flags: review?(mh+mozilla)
Attachment #8429660 - Flags: review?(jasone)
Attachment #8433612 - Flags: review?(jasone)
Attachment #8433612 - Flags: feedback?(mh+mozilla)
Both pushes look green:

https://tbpl.mozilla.org/?tree=Try&rev=1ad043945c24 (normal)
https://tbpl.mozilla.org/?tree=Try&rev=fff0f7d6d024 (only use last ditch pass)

I included OSX because addresses grow up on that platform. Windows + Linux + OSX should exercise all the paths.
Any updates here? I am eager to see this fix go in.

I haven't seen any activity from jasone here, does he read bugmail?
I've been skimming the bug activity as it has played out, but I'm naturally more interested in jemalloc changes to the stand-alone version than the old version embedded in mozilla.  In general, I'm reluctant to increase complexity in this area of jemalloc because I got it subtly wrong in several distinct ways in the past.  Furthermore, if this patch is primarily motivated by virtual memory exhaustion on Linux-derived systems, I'm more inclined toward using the --disable-munmap feature in the stand-alone jemalloc, which I'm guessing has the desired side effect of more thoroughly exhausting virtual memory before hard allocation failure.
Note that the part of the code this patch applies to is 95% backported from jemalloc3 (if not more, I think only the function signatures differ, which has minor consequences on the code).
The primary motivator is address space exhaustion on Windows, since we currently only ship 32-bit builds there.
> Furthermore, if this patch is primarily motivated by virtual memory exhaustion on
> Linux-derived systems

It's motivated by virtual memory *fragmentation* (leading to premature exhaustion). I don't think --disable-munmap would help much with that. And AFAICT, --disable-munmap also doesn't clean up, so if you allocate, use, and free a large amount of memory, your RSS stays high. Sure, reallocations will reuse those memory ranges, but there might not be such reallocations (think of a browser session staying up for days and one particular page on the first day having used a lot more memory than anything else since then). Or are all non-munmapped pages guaranteed to have been madvised?
Yes, the 'recorded' (to later be 'recycled') chunks are left mapped, but their backing pages are 'purged', so the only overhead is kernel virtual memory data structure overhead.  On Linux this is one pointer per page; I have no idea what the overhead is on Windows.

I added --disable-munmap because of virtual memory fragmentation problems on Linux (its heuristics for finding unmapped virtual addresses leave holes around indefinitely).  It sounds to me like the failure mode on Windows is similar, so I'd expect the record/recycle workaround to work similarly.
(In reply to Jason Evans from comment #61)
> Yes, the 'recorded' (to later be 'recycled') chunks are left mapped, but
> their backing pages are 'purged'

Ah, indeed, pages_purge is one of the first calls in chunk_record.

So, Emanuel, David, how about we backport chunk_record/chunk_recycle and see where that gets us? We can revisit smarter alignment finding if that gets us nowhere or not far enough.

Jason: it seems to me there should be an upper cap, though, on the disable-munmap feature. I don't see much reason to keep a hold on e.g 1GB of non-unmapped pages (especially since there are other things than jemalloc getting memory directly from the OS).
(In reply to Jason Evans from comment #57)
> I've been skimming the bug activity as it has played out, but I'm naturally
> more interested in jemalloc changes to the stand-alone version than the old
> version embedded in mozilla.

As far as I know the code we use on Linux/Mac/B2G/Android is still identical to jemalloc tip, and part 1 here brings Windows in line with that. I can post a version of the patch that applies to jemalloc tip instead if you'd prefer that context.

> In general, I'm reluctant to increase
> complexity in this area of jemalloc because I got it subtly wrong in several
> distinct ways in the past.

I understand your concern (this is the lowest level allocation path, after all!), but I'm fairly confident the changes here are correct. In particular, we have in-tree jsapi tests exercising the more complicated logic that landed in bug 1005849.

> Furthermore, if this patch is primarily
> motivated by virtual memory exhaustion on Linux-derived systems, I'm more
> inclined toward using the --disable-munmap feature in the stand-alone
> jemalloc, which I'm guessing has the desired side effect of more thoroughly
> exhausting virtual memory before hard allocation failure.

As Mike said, this is mostly intended to reduce memory fragmentation on 32-bit Windows. Other 32-bit platforms should also benefit, however.

(In reply to Jason Evans from comment #61)
> Yes, the 'recorded' (to later be 'recycled') chunks are left mapped, but
> their backing pages are 'purged', so the only overhead is kernel virtual
> memory data structure overhead.  On Linux this is one pointer per page; I
> have no idea what the overhead is on Windows.
> 
> I added --disable-munmap because of virtual memory fragmentation problems on
> Linux (its heuristics for finding unmapped virtual addresses leave holes
> around indefinitely).  It sounds to me like the failure mode on Windows is
> similar, so I'd expect the record/recycle workaround to work similarly.

This sounds like a good higher level solution. I don't think it makes the lower level logic entirely unnecessary (especially in low memory situations), but it should help to keep fragmentation from increasing.

(In reply to Mike Hommey [:glandium] from comment #62)
> So, Emanuel, David, how about we backport chunk_record/chunk_recycle and see
> where that gets us? We can revisit smarter alignment finding if that gets us
> nowhere or not far enough.

Okay, I'll work on this. I think the logic in attachment #8433612 [details] [diff] [review] is still useful for getting a reasonably contiguous set of aligned chunks in the first place, but caching the chunks we do have seems like a good idea.
Emanuel, are you still working on this?
Flags: needinfo?(emanuel.hoogeveen)
Sorry, I've been really busy with another bug that ended up taking a lot more work than expected. I'll work on this starting tomorrow.
Flags: needinfo?(emanuel.hoogeveen)
Attachment #8433612 - Attachment description: Part 2 v9: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|. → (Old) Part 2 v9: Add a 'last ditch' pass to find alignable chunks smaller than |size + alignment - pagesize|.
Attachment #8433612 - Flags: review?(jasone)
Attachment #8433612 - Flags: feedback?(mh+mozilla)
Attached patch Part 1: Remove support for the unused MALLOC_PAGEFILE. (obsolete) (deleted) — — Splinter Review
This removes support for the unused MALLOC_PAGEFILE, which was adding clutter to a bunch of places.
Attachment #8459223 - Flags: review?(mh+mozilla)
This removes the old allocation logic, moving Windows and the JEMALLOC_USES_MAP_ALIGN paths over to the jemalloc3 logic.
Attachment #8419109 - Attachment is obsolete: true
Attachment #8459224 - Flags: review?(mh+mozilla)
This makes all the big allocation paths (huge_palloc, base_alloc) go through chunk_alloc instead of calling pages_map manually.
Attachment #8459225 - Flags: review?(mh+mozilla)
Attached patch Part 4: Import chunk recycle code from jemalloc3. (obsolete) (deleted) — — Splinter Review
This part actually adds the new functionality. It seems to work - I was able to finish a run of Membench without any problems, and memory usage afterward looked reasonable (possibly better, but it's hard to say without looking at the variance). A few caveats:

1) Enabling the chunk recycling disables all unmapping - we hold on to the address space forever. That might not be so great in the presence of non-jemalloc allocators in low memory situations, though jemalloc obviously isn't going to get in its own way. Capping the amount of chunks that get saved shouldn't be too hard, but I haven't looked into that yet.

2) I disabled the address space coalescing on Windows. This might actually work for the most part, since we never unmap anything in this mode, but MALLOC_DECOMMIT (enabled on Windows) breaks that model. Trying to commit or decommit memory that spans multiple regions returned by VirtualAlloc doesn't really work - I'm not sure of the exact details, but I saw crashes until I disabled the coalescing.

Perhaps we can work around this by *only* caching 1MiB chunks (e.g. partially revert part 3 and free base and huge allocations in a way that bypasses chunk_record()). I'd rather not deviate too much from the code we're importing, but I don't think we can just get rid of MALLOC_DECOMMIT either (patching that in might be a prerequisite to moving to jemalloc3, as well).

3) This still isn't going to help us allocate unaligned chunks, so I don't think it'll fix this bug. It *will* help if deallocating causes the OS allocator to leave holes in the VM, as Jasone described, but I don't know if that's even a problem on Windows.

Just asking for feedback on this part since I'm not really sure what the next step should be.
Attachment #8459241 - Flags: feedback?(mh+mozilla)
Here's a try run on all four parts since I wanted to see how it does in practice: https://tbpl.mozilla.org/?tree=Try&rev=6f8d93df2865
One thing I forgot to mention about part 3 is that it will make 'base' allocations chunk-aligned where they weren't before. This matches jemalloc3 and *might* help this bug, though I don't know how common these were.
Comment on attachment 8459223 [details] [diff] [review]
Part 1: Remove support for the unused MALLOC_PAGEFILE.

Review of attachment 8459223 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry it took so long. This somehow got out of my radar.
Attachment #8459223 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8459224 [details] [diff] [review]
Part 2: Always use the jemalloc3 allocation logic and remove the old logic.

Review of attachment 8459224 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +2648,5 @@
>  static void *
>  chunk_alloc_mmap(size_t size)
>  {
> +#ifdef JEMALLOC_USES_MAP_ALIGN
> +        return pages_map_align(size, chunksize);

We may want to upstream this part.
Attachment #8459224 - Flags: review?(mh+mozilla) → review+
Attachment #8459225 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8459241 [details] [diff] [review]
Part 4: Import chunk recycle code from jemalloc3.

Review of attachment 8459241 [details] [diff] [review]:
-----------------------------------------------------------------

To make things on par with jemalloc3, you need to add --disable-munmap to the jemalloc3 configure arguments list in configure.in (search for "Run jemalloc configure script").

::: memory/mozjemalloc/jemalloc.c
@@ +2854,5 @@
> +	bool unzeroed;
> +	extent_node_t *xnode, *node, *prev, *xprev, key;
> +
> +#ifdef MALLOC_DECOMMIT
> +	pages_commit(chunk, size);

Committing before purging seems odd.

@@ +2872,5 @@
> +
> +	malloc_mutex_lock(&chunks_mtx);
> +	key.addr = (void *)((uintptr_t)chunk + size);
> +	node = extent_tree_ad_nsearch(chunks_ad, &key);
> +#ifndef MOZ_MEMORY_WINDOWS

Without coalescing, a 2*chunksize allocation won't find two adjacent chunksize blocks that would have been "freed" previously. So I think it's wrong to exclude this for Windows. That being said, I can see a problem here whether we decommit or not, and it seems to me there's a bug to fix in jemalloc3. I think the simpler way to address this is to never actually VirtualAlloc entire blocks but rather allocate multiple adjacent chunksize blocks.

(BTW, I don't remember why I ruled out porting decommit to jemalloc3, but with this recycling code, it might be worth doing)

As for the constant eating of address space, maybe we should add a cap to the amount of memory that can be chunk_record'ed. Jason probably has something to say about this.
Attachment #8459241 - Flags: feedback?(mh+mozilla) → feedback+
Flags: needinfo?(jasone)
(In reply to Mike Hommey [:glandium] from comment #74)
> > +#ifdef MALLOC_DECOMMIT
> > +	pages_commit(chunk, size);
> 
> Committing before purging seems odd.

Well, calling VirtualAlloc with MEM_RESET actually leaves the pages committed. I'm not sure what happens if you call MEM_RESET on decommitted pages; if that Just Works, we have a few options:

1) Commit everything here before resetting the pages, as in the patch, so that when we start using them the pages are already committed.
2) Commit everything when we start using the pages again.
3) Decommit everything instead of resetting it. I'm not sure this has any advantage over resetting, or if it makes accessing the reserved pages again slower.

Note that committing committed pages is a no-op, and so is decommitting decommitted pages. Calling MEM_RESET on pages that aren't committed might fail though.

> I think the simpler way to address this is to never actually
> VirtualAlloc entire blocks but rather allocate multiple adjacent chunksize
> blocks.

> As for the constant eating of address space, maybe we should add a cap to
> the amount of memory that can be chunk_record'ed. Jason probably has
> something to say about this.

Hmm, to do this we'd have to VirtualAlloc the entire block, then free it and VirtualAlloc the address range in chunks (which could fail due to threading). Otherwise we could reach the chunk caching cap and try to free a chunksized region that's actually part of a bigger block. Only huge allocations would have to do this, though, so it might not be that bad (although they would require more nodes for the accounting).

> (BTW, I don't remember why I ruled out porting decommit to jemalloc3, but
> with this recycling code, it might be worth doing)

Yeah, I was thinking of doing this for bug 801536. It should be pretty much a direct port except for the ifdefs.
Sorry for sitting on this for so long, I've been busy with course work and I wasn't happy with my solutions for the problems with part 4.

I had a bit of a brain wave on how to simplify things today, and I'm happy with the way things turned out. I decided to move the patches to bug 1073662 since I'm still not wholly convinced they will fix this bug, and I don't want to split landings across different trains. If crash stats show that part 4 in that bug fixes this bug, we can close it.
Depends on: 1073662
Attachment #8459223 - Attachment is obsolete: true
Attachment #8459224 - Attachment is obsolete: true
Attachment #8459225 - Attachment is obsolete: true
Attachment #8459241 - Attachment is obsolete: true
With the patches in bug 1073662 landed, does anything remain to be done here?
I'm not sure those solve the problem here, of chunk allocation missing aligned 1 MiB holes because the first 1 MiB chunk happens to be unaligned. The caching might help, and part 3 in that bug makes 'base' allocations chunk-aligned, but the underlying allocation logic is the same.

I think we should wait until we have numbers on the amount of 1 MiB holes present in OOM crashes after the patches in that bug, to see if more needs to be done here.
> I think we should wait until we have numbers on the amount of 1 MiB holes
> present in OOM crashes after the patches in that bug, to see if more needs
> to be done here.

I'll collect these numbers. We should have enough data in 2-3 days.
Details in bug 1001760 comment 9. The short version is: there are still some moderate gains that can be squeezed out of this bug. If we can easily add a 'last ditch' attempt like the JS fix, I think it would be worthwhile.
This is basically the old patch, rebased on top of bug 1073662 and bug 1100485. I changed MAX_LAST_DITCH_ATTEMPTS to 32 and added a comment justifying the number; see also the discussion in bug 1001760. It's my opinion that the cost of trying a few more times is low (especially compared to an OOM crash), and with the new chunk cache we shouldn't hammer on the OS allocator as much as we used to (unless we're continually allocating, in which case we'll OOM anyway).
Attachment #8433612 - Attachment is obsolete: true
Attachment #8525750 - Flags: review?(mh+mozilla)
Hmm, I didn't rebase hard enough - various places in the patch were still calling the old 3-parameter version of pages_map(), but for some reason MSVC doesn't give a damn. That's pretty disconcerting, but at least our other compilers don't seem to be so sloppy. I'll update the patch after I get a green run on try (it's otherwise unchanged).
> Hmm, I didn't rebase hard enough - various places in the patch were still
> calling the old 3-parameter version of pages_map(), but for some reason MSVC
> doesn't give a damn.

Wow, apparently it is only a warning in .c files:
warning C4020: 'pages_map' : too many actual parameters
glandium: review ping?
I'm putting this on hold until I figure things out about bug 1073662 and until we switch to jemalloc 3 (which we're planning for soon).
FWIW, a port of this code is now in production for the GC's allocator and being tested by a jsapi-test, see bug 1118481. The GC's allocator was previously using an older version of the logic with a more complicated path on Linux and OSX, so this was a low risk simplification.
(In reply to Mike Hommey [:glandium] from comment #86)
> I'm putting this on hold until I figure things out about bug 1073662 and
> until we switch to jemalloc 3 (which we're planning for soon).

Given that jemalloc3 is in an indefinite waiting state, perhaps we should just do this. glandium, what do you think?
Flags: needinfo?(mh+mozilla)
glandium says that jemalloc3 is now close to being turned on, as in, hopefully it can happen in the next week or two. With that in mind, it's probably worth holding out here for just a little longer...
(In reply to Nicholas Nethercote [:njn] from comment #89)
> glandium says that jemalloc3 is now close to being turned on, as in,
> hopefully it can happen in the next week or two. With that in mind, it's
> probably worth holding out here for just a little longer...

It seems that this did not happen. Is there an updated ETA for jemalloc, and do you still intend to wait for it? Thanks.
Flags: needinfo?(n.nethercote)
I talked to glandium about this yesterday. It's very likely we'll end up giving up on jemalloc4, in which case this bug will become relevant again. glandium should have results on his final last-gasp experiments with jemalloc4 this week. Thank you for your patience :)
Flags: needinfo?(n.nethercote)
I don't know how much time I'll be able to dedicate to this bug, as I need to focus on my studies. So if significant additional work is needed, someone else may need to take over.

Something I've been meaning to fix, but which I've been putting off due to the uncertain situation of the current jemalloc implementation, is cache eviction. The current implementation has none, which probably makes it less than optimal in low memory situations, especially on Windows where we only reuse chunksized regions. I see two potential issues (assume a user requests an N MB region of address space):

(1) No such region is available in the cache, nor outside it. However, by combining the cached and free address space, a contiguous region of sufficient size could be found. This affects all platforms. Perhaps this could do something clever, like evicting the biggest contiguous regions first, though it might not be worth the effort over just evicting everything.

(2) (Windows only) A region of sufficient size is available in the cache, but we bypass the cache because we only reuse chunksized regions on Windows. This could be solved by checking the cache for larger regions on allocation failure, then deallocating the chunksized regions and reallocating.

If we end up evicting everything to solve (1), that does also mean we lose access to our nice list of aligned 1MB regions. The more complicated allocation code in this bug would mitigate that and help rebuild our cache.

I don't know if I'll be able to dedicate time toward solving these issues though. I just wanted to have them on file in a more high profile bug :)
(In reply to Nicholas Nethercote [:njn] from comment #91)
> I talked to glandium about this yesterday. It's very likely we'll end up
> giving up on jemalloc4, in which case this bug will become relevant again.
> glandium should have results on his final last-gasp experiments with
> jemalloc4 this week. Thank you for your patience :)

Those experiments turned out to be successful and jemalloc4 was just enabled on Nightly, though it's not yet riding the trains. So that puts this bug back into a holding pattern, I guess...
No, we should try to implement this. And the great thing is that with the chunk hooks, we can now do so without touching the jemalloc code.
Flags: needinfo?(mh+mozilla)
Just went through the comments. I guess the first step to move this forward is to re-enable jemalloc4, i.e., look into bug 1219914, then implement the patch (the slide similar logic) through chunk hooks?
Flags: needinfo?(mh+mozilla)
(In reply to Ting-Yu Chou [:ting] from comment #95)
> Just went through the comments. I guess the first step to move this forward
> is to re-enable jemalloc4, i.e., look into bug 1219914, then implement the
> patch (the slide similar logic) through chunk hooks?

Enabling jemalloc4 causes Talos regressions, see bug 1219914 comment 23. I guess it won't be enabled soon.
Flags: needinfo?(mh+mozilla)
Are there any concerns or anything else I should do to get the patch reviewed? Thank you.
Flags: needinfo?(mh+mozilla)
Comment on attachment 8525784 [details] [diff] [review]
v2 - Rework chunk allocation to align misaligned chunks and add a 'last ditch' pass to find any remaining usable chunks.

Could you review this? If not, what else I should do to move this forward? Thanks.
Attachment #8525784 - Flags: review?(n.nethercote)
Comment on attachment 8525784 [details] [diff] [review]
v2 - Rework chunk allocation to align misaligned chunks and add a 'last ditch' pass to find any remaining usable chunks.

Review of attachment 8525784 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, I don't know mozjemalloc well enough to review this. glandium is the right person for this. The prospect of jemalloc4 being used was blocking him from reviewing, but given that is not going to happen for some time, maybe it's the right time to do this...
Attachment #8525784 - Flags: review?(n.nethercote)
If there's no concrete timetable for the switch to jemalloc4, I can do whatever is needed to get this in. Note that almost identical code has been used for years now by SpiderMonkey's GC; I think the only controversial part is the hand-written atomic access (unfortunately Visual Studio *still* doesn't support C11 atomics as far as I can tell, and I'm not sure about GCC).
Flags: needinfo?(mh+mozilla)
Emanuel, want to take another stab at this given that we're no longer blocking on jemalloc 4?
Flags: needinfo?(emanuel.hoogeveen)
Dropping to P2, with 64-bit around the corner this will be less of an issue but it would still be good to get a fix.
Whiteboard: [MemShrink:P1] → [MemShrink:P2]
Sorry about the delay, I've been in the middle of a big move so things have been kind of crazy. Yes, this would be good to do; the hairiest part is still the atomic access - that wouldn't be an issue if we could move mozjemalloc to C++. The old patch here probably still applies though, I can have a look at it in a few days.
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #103)
> that wouldn't be an issue if we could move mozjemalloc to C++.

Good news! mozjemalloc is now compiled as C++. This was bug 1365194. :)

Emanuel, do you still want this to be assigned to you?

Flags: needinfo?(emanuel.hoogeveen)

Clear a needinfo that is pending on an inactive user.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jasone)

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: emanuel.hoogeveen → nobody
Status: ASSIGNED → NEW

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit auto_nag documentation.

Flags: needinfo?(emanuel.hoogeveen)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: