Closed Bug 1479960 Opened 6 years ago Closed 5 years ago

Add support for freezing shared memory

Categories

(Core :: IPC, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla70
Fission Milestone M4
Tracking Status
firefox70 --- fixed

People

(Reporter: jld, Assigned: jld)

References

(Blocks 1 open bug)

Details

Attachments

(10 files)

(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
MemMapSnapshot kind of reimplements shared memory in order to have the Freeze method: dropping write permissions in a way that can't be undone later. Also it's using file_util, which I'd like to remove, and more to the point it's using unlinked files on OSes where there are better alternatives that we already support (shm_open on Mac: bug 1447867) or will (memfd_create: bug 1440203). So I want to implement this properly in SharedMemoryBasic. Methods of dropping write access vary: * Windows: DuplicateHandle can create a new handle without write access * shm_open (POSIX): open the segment twice, as writable and readable, before unlinking * ashmem (Android): the ASHMEM_SET_PROT_MASK ioctl will remove (but not add[1]) permissions from the segment (not just the fd[2]) * memfd_create (Linux 3.17): MFD_ALLOW_SEALING + F_SEAL_WRITE revokes write permission from the file (not just the fd, and it fails if writable shared mappings exist) Note that we don't have a portable way for a process that receives an untrusted shmem to prevent further modifications by the process that sent it, other than just copying out the data. In principle this feature could also be exposed in ipc::Shmem (with a dynamic check that it's in the parent process and hasn't ever been shared, and probably a ReadOnlyShmem type that's copyable or threadsafe-refcounted or something), but I don't know if there'd be enough consumers to make it worthwhile (vs. just using SharedMemoryBasic directly). [1] https://android.googlesource.com/platform/system/core/+/4c87c1743efecaf992c7d0340041458f78fc9294/libcutils/tests/AshmemTest.cpp#207 [2] https://android.googlesource.com/platform/system/core/+/4c87c1743efecaf992c7d0340041458f78fc9294/libcutils/tests/AshmemTest.cpp#231
We need more than support for freezing shared memory. We also need to be able to share a single memory region between multiple processes, which Shmem does not currently support. It can only be shared between two processes, and generally only one is allowed to access it at a time. I agree that ideally this should be handled by Shmem, or something similar, though.
…and by SharedMemoryBasic I mean base::SharedMemory. See also bug 1454819.
(In reply to Kris Maglione [:kmag] from comment #1) > We need more than support for freezing shared memory. We also need to be > able to share a single memory region between multiple processes, which Shmem > does not currently support. It can only be shared between two processes, and > generally only one is allowed to access it at a time. > > I agree that ideally this should be handled by Shmem, or something similar, > though. Right (and good point about Shmem also being bound to a single channel); for this bug I was just going to add support in base::SharedMemory and not try to do anything fancy with higher-level interfaces or IPDL.
In light of https://bugs.chromium.org/p/project-zero/issues/detail?id=1671 (Mac POSIX shm was always mappable read/write, even from a read-only fd), an additional requirement is a unit test to make sure freezing actually works.
I think I finally have this working on all of our platforms. Both MacOS and Android need the workaround at [1], originally for a now 20-year-old version of OSF/1, because: * MacOS (using shm_open) rejects all mappings with MAP_PRIVATE, even when it's read-only and there's no semantic difference * Android (using ashmem) seems to quietly map zeroed memory instead of the shared memory's contents In particular, all zeroes represents the empty SharedStringMap, so that one caused mysterious failures from missing pref settings (especially the one for ForcePermissiveCOWs) and probably also locale strings (error messages?). I don't yet know the root cause of the Android bug; ashmem's mmap method looks superficially reasonable but I don't know Linux VM internals very well. Also, both MacOS and Android need to skip the fstat()ing that AutoMemMap currently does[2], because we already have the exact size (give or take some cleanup in base::SharedMemory), and: * Android: ashmem seems to just not implement fstat; instead there's an ASHMEM_GET_SIZE ioctl * MacOS: the fstat is checked against the sandbox policy; we could allow it, but we don't need to, so let's not And comment #4 is not quite correct: with that bug, the mmap was still required to be read-only, but later calls to mprotect weren't checked, so just trying construct a writeable SharedMemory with the fd wouldn't work. However, I should still write some unit tests. [1] https://searchfox.org/mozilla-central/rev/ecf61f8f3904549f5d65a8a511dbd7ea4fd1a51d/nsprpub/pr/src/md/unix/unix.c#3572 [2] https://searchfox.org/mozilla-central/rev/ecf61f8f3904549f5d65a8a511dbd7ea4fd1a51d/js/xpconnect/loader/AutoMemMap.cpp#66

Just to be clear, the security bugs mentioned above do not affect our current read-only shared memory implementations, only the, er, more elegant solutions Jed intends to use for the IPC-owned helper. So it shouldn't block any of the shared memory work that we currently have in-flight which does not depend on that.

Blocks: 1533569
No longer blocks: 1474793

To clarify, when this bug speaks of "freezing" a shared memory region, does that mean the region will become permanently read-only for all users, or will it be possible to enforce a region being read-only for content processes, while remaining writable for the chrome process that originally created it?

(In reply to Jonathan Kew (:jfkthame) from comment #7)

To clarify, when this bug speaks of "freezing" a shared memory region, does that mean the region will become permanently read-only for all users, or will it be possible to enforce a region being read-only for content processes, while remaining writable for the chrome process that originally created it?

Read-only for all users. One-sided read-only isn't possible with the ashmem and memfd backends. So, you'd need to create and send a new snapshot, or send incremental updates in separate memory segments.

This sounds like a potential problem for the shared-fontlist code, which depends on being able to incrementally update the shared data that the content processes are reading. The current patches handle this by allocating a large (256K) shared memory segment to use as an allocation arena, and then handing out smaller chunks of this for the various records that need to be stored and shared. Eventually (depending on the number of fonts installed and used) additional 256K segments will be allocated to extend the arena as needed.

In the course of a session, we're likely to do several hundred separate updates to the font list, as and when we load data for additional font families and individual faces. So if we were to use a new segment for each update, what sort of overheads are there per shared-memory segment? How many separate segments can we reasonably allocate, before running into performance degradation or hard limits -- dozens? hundreds? thousands?

Flags: needinfo?(jld)

(In reply to Jonathan Kew (:jfkthame) from comment #9)

In the course of a session, we're likely to do several hundred separate updates to the font list, as and when we load data for additional font families and individual faces. So if we were to use a new segment for each update, what sort of overheads are there per shared-memory segment? How many separate segments can we reasonably allocate, before running into performance degradation or hard limits -- dozens? hundreds? thousands?

Thousands. There shouldn't be much additional overhead for shared memory mappings over non-shared mappings.

In principal, the main limiting factor should be the 64KB kernel VMA limit on Linux, which applies equally to shared mappings and non-shared mappings. In practice, the fact that some Linux environments place a hard limit on the number of file descriptors a process is allowed to have open is more likely to be a problem, since the parent process will need to keep a file descriptor open for each segment it needs to be able to share with new file processes.

The best option is probably to occasionally coalesce segments and broadcast them to the child processes to remap in place of their existing segments. The SharedMap code that we use for shared JS data does something similar, except that it skips incremental updates entirely and just rebuilds the entire snapshot with each update (while trying to coalesce as many updates as possible into a single rebuild, and defer the rebuild until an idle slice).

I don't know if the shared font code would be able to get away with that, but I'm pretty sure it should be able to get away with a hybrid incremental/full rebuild approach.

How many separate segments can we reasonably allocate, before running into performance degradation or hard limits -- dozens? hundreds? thousands?

Thousands. There shouldn't be much additional overhead for shared memory mappings over non-shared mappings.

I guess the other thing I should've asked is what is the typical cost (time-wise) of allocating a new shared memory segment? I'm imagining it is probably a lot more expensive than simply grabbing the next N bytes of an existing region.

The best option is probably to occasionally coalesce segments and broadcast them to the child processes to remap in place of their existing segments.

I'm a bit wary of doing that, as I think it would force all the processes to entirely flush their font and shaped-text caches, and reflow everything, which we really don't want to be triggering if we can help it. With the current patches, the only time this happens is if the actual list of available fonts changes during the browser session -- e.g. the user installs a new font while Firefox is running -- which is (a) a rare event, and (b) likely to be at a time when Firefox performance isn't in focus, because the user is busy managing fonts rather than loading web pages.

I'll do some instrumentation to determine how many individual allocations are currently happening in a few typical runs, though I suspect the number will be rather large.

(In reply to Jonathan Kew (:jfkthame) from comment #11)

I guess the other thing I should've asked is what is the typical cost (time-wise) of allocating a new shared memory segment? I'm imagining it is probably a lot more expensive than simply grabbing the next N bytes of an existing region.

That's a good question. It might be a good idea to profile, but as a rough idea:

  • On Windows, it should be pretty cheap, even as things stand now. The shared memory APIs are pretty lightweight.
  • On Linux, when we're using /dev/shm it's relatively expensive. Each allocation requires several path-based filesystem operations, each of which also happens to require a system call. Once we switch to memfd_create and ashmem, it should be comparable to Windows.

Regardless, it would obviously be more expensive than just adding onto the end of an existing allocation, but probably not much more than a malloc().

The real cost in either case, though, would probably be broadcasting the descriptors to the content processes.

The best option is probably to occasionally coalesce segments and broadcast them to the child processes to remap in place of their existing segments.

I'm a bit wary of doing that, as I think it would force all the processes to entirely flush their font and shaped-text caches, and reflow everything, which we really don't want to be triggering if we can help it. With the current patches, the only time this happens is if the actual list of available fonts changes during the browser session -- e.g. the user installs a new font while Firefox is running -- which is (a) a rare event, and (b) likely to be at a time when Firefox performance isn't in focus, because the user is busy managing fonts rather than loading web pages.

Hm. I was assuming we would be able to do this with little disruption as long as data doesn't change, but I don't know how references to the shared memory data are stored in the current regime.

If it's that difficult to update it without rebuilding everything, it would probably be possible to remap the coalesced data to the same addresses, and with the same layout as the piecemeal data. That would leave us some wasted padding, and we'd need some sort of fallback to handle another thread mapping something else in our target address space in the middle of the update, but it would probably still be better than not coalescing the mappings.

We might have a solution for one-sided write access; it should work with memfd but it's not yet known if it works with ashmem.

On Linux, a process can open /proc/self/fd/N to get a new file descriptor for the object referenced by its existing fd N, with a new set of flags and a new check of its credentials against the object's permissions. This came up before in some context previously (but apparently not the comments of this bug, yet), because a process with access to /proc/self/fd could regain write permissions on a “frozen” fd to an unlinked /dev/shm file, for example. (The sandbox policies don't allow that, so we're safe there.)

But it can also drop permissions, so it's possible to get an O_RDONLY fd to a memfd that way, and it's not possible to create PROT_WRITE MAP_SHARED mappings with that fd (or write it with write).

(I can't take the credit for this discovery; it was pointed out by someone Alex Gaynor asked about this problem, whereas I'd forgotten about that feature of procfs magic symlinks.)

What I'm not sure about yet is Android: the same trick should work to get a read-only ashmem fd, but I don't trust that ashmem will follow the normal Unix file behavior without testing. (We also don't have sandboxing on Android yet, but there are plans for it, and I don't want it to become impossible because of what we do here.)

But also, newer Androids should have memfd: definitely Oreo (when Google started enforcing kernel versions) and potentially back to Marshmallow depending on the device.

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #13)

We might have a solution for one-sided write access; it should work with memfd but it's not yet known if it works with ashmem.

On Linux, a process can open /proc/self/fd/N to get a new file descriptor for the object referenced by its existing fd N, with a new set of flags and a new check of its credentials against the object's permissions. This came up before in some context previously (but apparently not the comments of this bug, yet), because a process with access to /proc/self/fd could regain write permissions on a “frozen” fd to an unlinked /dev/shm file, for example. (The sandbox policies don't allow that, so we're safe there.)

The problem with this is that we don't currently require procfs, and certain users (and Tor Browser) want to run without it.

(In reply to Kris Maglione [:kmag] from comment #14)

The problem with this is that we don't currently require procfs, and certain users (and Tor Browser) want to run without it.

We kind of do require procfs on kernels before 3.17: it's used by the sandbox and we effectively release-assert that it's present, unless sandboxing is completely turned off. (And this doesn't seem to have come up on the Tor bug yet; maybe nobody interested in the full-browser sandbox is using an older LTS distro?) But, yes. We'd have to detect that case and fall back from memfd to POSIX shm, and hopefully that won't be a problem on Android.

It doesn't work on Android, but not the way I thought it wouldn't work. A read/write mapping of a read-only ashmem fd fails with EACCES, as expected. However, a read-only mapping of a read-only fd also fails, with EINVAL.

The problem seems to be using procfs to reopen the inode: doing that with O_RDWR instead of O_RDONLY returns a fd that fails mmap with EINVAL both for read-only and read/write.

So it looks like the only way we could get memory that one process can write and another can only read (including if malicious but adequately sandboxed) on Android with kernel < 3.17 is to create a temporary file on an actual storage-backed filesystem (i.e., not tmpfs), which is not ideal. I just did some experiments with memory-mapped ext4 files (vs. memfd) and checking the clean/dirty stats in /proc/N/smaps, and Linux appears to write changes back to secondary storage even if the file was unlinked.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #4)

In light of https://bugs.chromium.org/p/project-zero/issues/detail?id=1671
(Mac POSIX shm was always mappable read/write, even from a read-only fd), an
additional requirement is a unit test to make sure freezing actually works.

I did this, and there's a problem: CVE-2018-4435 is fixed only in MacOS 10.12 and up (because older versions are no longer supported by Apple), but we still support back to 10.9, and (helpfully, in this case) our CI runs on 10.10

So we're going to need to continue using temporary files for freezable shared memory on older MacOS. (Non-freezeable can still use shm_open.) I have an implementation and it passes simple unit tests.

FreeBSD's SHM_ANON is useful for the usual case of shared memory, but it
doesn't support freezing. It could be re-added later, but for now it's
simplest to remove it (and use named shm instead) while refactoring.

Depends on D26738

Despite the comment saying not to use the "handle" except as an opaque
identifier, it is being used to pass the handle to other OS APIs. Direct
access to the handle needs to be controlled to make sure freezing is
safe, so this patch replaces that with interfaces that are more explicit
about ownership and lifetime.

Depends on D26739

The Unix backend for shared memory needs to keep the mapped size to pass
to munmap, while the Windows backend doesn't. Currently it's reusing the
max_size field, and then zeroing it when it's unmapped, which breaks the
freezing use case. This patch uses a dedicated field for that.

Depends on D26740

This uses RAII to handle error-case cleanup in the POSIX backend for
SharedMemory::Create, to simplify the complexity that will be added to
support freezing.

Depends on D26741

This allows writing to shared memory and then making it read-only before
sharing it to other processes, such that a malicious sandboxed process
cannot regain write access. This is currently available only in the
low-level base::SharedMemory interface.

The freeze operation exposes the common subset of read-only shared
memory that we can implement on all supported OSes: with some APIs
(POSIX shm_open) we can't revoke writeability from existing capabilies,
while for others (Android ashmem) we must revoke it. Thus, we require
that the writeable capability not have been duplicated or shared to
another process, and consume it as part of freezing. Also, because in
some backends need special handling at creation time, freezeability must
be explicitly requested.

In particular, this doesn't allow giving an untrusted process read-only
access to memory that the original process can write.

Note that on MacOS before 10.12 this will use temporary files in order to
avoid an OS security bug that allows regaining write access; those OS
versions are no longer supported by Apple (but are supported by Firefox).

Depends on D26742

One problem with using shared memory instead of files for MemMapSnapshot
is that AutoMemMap was trying to use fstat() to obtain the object size;
that doesn't work with ashmem on Android and was causing problems with
the Mac sandbox, but it's not necessary, because we already know the
size. This patch changes it to not do that.

Depends on D26743

There was a strange bug when converting SharedStringMap to use shared
memory: on Android, some tests would fail because a pref wasn't set or
there was something wrong with an expected error message. The root
cause was that mapping ashmem with MAP_PRIVATE results in all zeroes
(see bug 1525803), but that was read as a valid SharedStringMap with
length 0.

To prevent any possible future issues like that, this patch adds a
nonzero magic number to the header. It fits into padding on 64-bit and
the cost of setting and checking it should be essentially free.

Depends on D26744

Also refactor SharedMemoryBasic::SystemProtect to allow testing cases
that are expected to fail.

Depends on D26748

Fission Milestone: --- → M2
Fission Milestone: M2 → M3
Blocks: 1550037
No longer blocks: 1550037

Jed, can you add a quick summary of where we're at on this?

It also seems like this should still block 1550037, from what I understand the shared font impl doesn't need read/write yet (I could be wrong about that), Jonathan can you confirm?

Flags: needinfo?(jld)
Flags: needinfo?(jfkthame)

The shared font-list implementation wants ongoing read/write access from the parent process, while child (content) processes have read-only access. AIUI that's what bug 1550900 intends to offer, so bug 1550037 (which should probably be re-titled a bit) will build on that.

To make the font-list work with totally-frozen blocks (i.e. the parent loses all write access as soon as it shares the block to any content process) would require its allocation strategy to be reworked to use many smaller shmem blocks, instead of allocating larger "arenas" within which it can incrementally sub-allocate extremely cheaply. I'm hoping bug 1550900 means we can avoid going that way.

Flags: needinfo?(jfkthame)

(In reply to Eric Rahm [:erahm] from comment #28)

Jed, can you add a quick summary of where we're at on this?

It turns out there's a regression caused by bug 1540135 encoding the old Chromium code's path names, as seen on Try here and in the code here, so I need to fix that.

IMHO we should ignore anything in /dev/shm for that — it's not really I/O any more than mmap/munmap are. But it will also need a special case for MacOS ≤ 10.11 (where we actually will do disk I/O, as a workaround for the above-mentioned security bug; and in theory that would have also failed in my Try run, because the filenames changed, except for some reason the fancy new task-based try chooser didn't run any Mac opt tests and this test is disabled on debug builds pending bug 1549723).

Flags: needinfo?(jld)

The tests for unexpected main thread I/O had exemptions for the specific
paths that were being used for shared memory, which would cause it to
fail with the changes in this bug. This patch does two things:

  1. On Linux, /dev/shm is always tmpfs (a memory filesystem), so it's not
    going to cause disk I/O, and it's used by glibc to implement the POSIX
    standard shm_open API. This allows all /dev/shm paths instead of
    limiting it to a hard-coded prefix.

  2. On MacOS, with the patches in this bug, we'll no longer use temporary
    files for shared memory on current OS versions, but we still need them on
    older versions to avoid an OS bug (https://crbug.com/project-zero/1671),
    and they are backed by disk in this case, so we want to allow only the
    IPC files. However, the path prefix has changed.

Fission Milestone: M3 → M4
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ae525f20bdbb Remove SHM_ANON support. r=froydnj https://hg.mozilla.org/integration/autoland/rev/71253b62a633 Get rid of base::SharedMemory::handle. r=froydnj https://hg.mozilla.org/integration/autoland/rev/c47d47d4073d Fix max_size in shared_memory_posix. r=froydnj https://hg.mozilla.org/integration/autoland/rev/7b78b1945532 Clean up shared_memory_posix error handling. r=froydnj https://hg.mozilla.org/integration/autoland/rev/c418095bb7b7 Fix the main thread I/O tests to handle the IPC shared memory changes. r=florian https://hg.mozilla.org/integration/autoland/rev/d0d383e5bae1 Add freezing of IPC shared memory. r=froydnj,kmag https://hg.mozilla.org/integration/autoland/rev/26a8ebcb2db4 Make AutoMemMap not fstat() the mapped object if it doesn't need to. r=kmag https://hg.mozilla.org/integration/autoland/rev/fdba3c5f8fd0 Give SharedStringMap a magic number so that all zeroes isn't a valid instance. r=kmag https://hg.mozilla.org/integration/autoland/rev/41080844e645 Convert MemMapSnapshot to use frozen shared memory. r=kmag https://hg.mozilla.org/integration/autoland/rev/5645e0cc3915 Add unit tests for shared memory freezing. r=froydnj
Backout by aciure@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cc3161d561a1 Backed out 15 changesets (bug 1479960, bug 1426526, bug 1534780, bug 1536697) for causing build bustages CLOSED TREE
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6c2b995a9d30 Remove SHM_ANON support. r=froydnj https://hg.mozilla.org/integration/autoland/rev/39e18373e3d3 Get rid of base::SharedMemory::handle. r=froydnj https://hg.mozilla.org/integration/autoland/rev/036809330a51 Fix max_size in shared_memory_posix. r=froydnj https://hg.mozilla.org/integration/autoland/rev/0aa8af4965c5 Clean up shared_memory_posix error handling. r=froydnj https://hg.mozilla.org/integration/autoland/rev/9d7f1835f96f Fix the main thread I/O tests to handle the IPC shared memory changes. r=florian https://hg.mozilla.org/integration/autoland/rev/8332565a6943 Add freezing of IPC shared memory. r=froydnj,kmag https://hg.mozilla.org/integration/autoland/rev/47d387b6cd4a Make AutoMemMap not fstat() the mapped object if it doesn't need to. r=kmag https://hg.mozilla.org/integration/autoland/rev/8a322064cf6d Give SharedStringMap a magic number so that all zeroes isn't a valid instance. r=kmag https://hg.mozilla.org/integration/autoland/rev/2c4d12bdfec3 Convert MemMapSnapshot to use frozen shared memory. r=kmag https://hg.mozilla.org/integration/autoland/rev/a8518ea4b594 Add unit tests for shared memory freezing. r=froydnj
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla69 → ---
Flags: needinfo?(jld)

Bug 1562358 will take care of the build breakage (confirmed on Try), but I'm going to wait until after the branch next Monday to try to re-land.

Flags: needinfo?(jld)

Hey Jed, can we try again now that the soft freeze is over?

Flags: needinfo?(jld)
Flags: needinfo?(jld)

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:jld, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)
Flags: needinfo?(jld)
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/16d99c5a8a55 Remove SHM_ANON support. r=froydnj https://hg.mozilla.org/integration/autoland/rev/f814f58d43d6 Get rid of base::SharedMemory::handle. r=froydnj https://hg.mozilla.org/integration/autoland/rev/bdf7a041928a Fix max_size in shared_memory_posix. r=froydnj https://hg.mozilla.org/integration/autoland/rev/6da235c5a77c Clean up shared_memory_posix error handling. r=froydnj https://hg.mozilla.org/integration/autoland/rev/ddfa5ff81061 Fix the main thread I/O tests to handle the IPC shared memory changes. r=florian https://hg.mozilla.org/integration/autoland/rev/225411558a4e Add freezing of IPC shared memory. r=froydnj,kmag https://hg.mozilla.org/integration/autoland/rev/0f466f2a18c0 Make AutoMemMap not fstat() the mapped object if it doesn't need to. r=kmag https://hg.mozilla.org/integration/autoland/rev/1fa598bf2699 Give SharedStringMap a magic number so that all zeroes isn't a valid instance. r=kmag https://hg.mozilla.org/integration/autoland/rev/c3ca7014893c Convert MemMapSnapshot to use frozen shared memory. r=kmag https://hg.mozilla.org/integration/autoland/rev/d8ac382b5f17 Add unit tests for shared memory freezing. r=froydnj
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: