Open Bug 1531887 Opened 6 years ago Updated 2 years ago

Fenix segfaults in arena_dalloc when Rust code calls std::fs::canonicalize

Categories

(Core :: Memory Allocator, defect, P1)

Unspecified
Android
defect

Tracking

()

People

(Reporter: myk, Assigned: glandium)

References

Details

Firefox for Android (Fennec) segfaults in arena_dalloc when Rust code calls std::fs::canonicalize. The top of the stack looks like this:

arena_dalloc(void*, unsigned long, arena_t*) mozjemalloc.cpp:3307
Allocator<MozJemallocBase>::free(void*) malloc_decls.h:41
std::sys::unix::fs::canonicalize::h539ef1ed7a3692e0 0x000073c9ce8471e3
std::fs::canonicalize::hfcd6e3e807762bc3 0x000073c9ce87be36
(Rust code that calls std::fs::canonicalize)

There are some calls to std::fs::canonicalize in-tree, although I don't know if any are currently being invoked on Android. I noticed this with the patch in bug 1429796 (where I've provided a workaround that avoids the call).

I haven't been able to reproduce in a Rust crate outside Gecko, even when specifying jemalloc as the global allocator via the jemallocator crate. But it's easy to reproduce in Gecko by adding some code that calls std::fs::canonicalize to a Rust function that gets called at startup, like GkRust_Init:

    let dir = std::env::temp_dir();
    let path = std::fs::canonicalize(dir.as_path()).expect("path");
    dbg!(path);

Here's a patch that does that:

https://github.com/mykmelez/gecko/compare/central...test-fs-canonicalize

This may be a Rust stdlib bug, but since I've only been able to reproduce it in Gecko with mozjemalloc, I'm prospectively filing it here.

Myk, can you provide the logcat output, a core dump, or a crash report? I'm guessing it's this line 3307, which implies some sort of memory overflow or a null arena. Reproducing with a debug build would be helpful as we might trip some of the earlier assertions.

Flags: needinfo?(myk)

... or freeing a pointer that was not allocated, or freeing a pointer that was already freed (in the rare case where that was the last allocation and the chunk containing it was reclaimed).

For some reason, my debug build isn't crashing on the assertion, but this tryserver run https://treeherder.mozilla.org/#/jobs?repo=try&revision=a84b8d286279524a9693c259087a220891f77341 of mgoodwin's patch in bug 1429796 (which calls std::fs::canonicalize via a function in the rkv crate) does crash on that assert when running xpcshell tests. Here's an example with stack traces and minidump:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230752223&repo=try&lineNumber=2030

Ouch, it's a difficult problem caused by rust's libstd. Here's the code:

pub fn canonicalize(p: &Path) -> io::Result<PathBuf> {
    let path = CString::new(p.as_os_str().as_bytes())?;
    let buf;
    unsafe {
        let r = libc::realpath(path.as_ptr(), ptr::null_mut());
        if r.is_null() {
            return Err(io::Error::last_os_error())
        }
        buf = CStr::from_ptr(r).to_bytes().to_vec();
        libc::free(r as *mut _);
    }
    Ok(PathBuf::from(OsString::from_vec(buf)))
}

The problem is that libc::free is free(), which we redirect to jemalloc. And libc::realpath allocates memory with the Android system allocator. Kaboom.

A workaround would be to duplicate this function and change it to pass a PATH_MAX-sized buffer as second argument to libc::realpath, which would prevent it from allocating. Then libc::free would not be necessary and wouldn't fail. I'll open a rust issue to see what they think about it for std itself.

(In reply to Mike Hommey [:glandium] from comment #4)

A workaround would be to duplicate this function and change it to pass a PATH_MAX-sized buffer as second argument to libc::realpath, which would prevent it from allocating. Then libc::free would not be necessary and wouldn't fail. I'll open a rust issue to see what they think about it for std itself.

It looks like they changed from using a buffer in the past for esoteric "what if this is gnu hurd" reasons. It seems like the alloc/free mismatch is more important to handle, so probably not a hard sell.

Blocks: 1429796

(Clearing the needinfo flag that I meant to clear in conjunction with comment #3.)

Flags: needinfo?(myk)

(In reply to Myk Melez [:myk] [@mykmelez] from comment #0)

There are some calls to std::fs::canonicalize in-tree, although I don't know if any are currently being invoked on Android. I noticed this with the patch in bug 1429796 (where I've provided a workaround that avoids the call).

For those following along at home, what's the workaround?

Flags: needinfo?(myk)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #8)

For those following along at home, what's the workaround?

Avoiding the use of rkv::Manager means there are calls to canonicalize. The good news is that this fixes the android crasher. The bad news is it seems we have other (similar looking) crashes in rkv on other platforms; e.g. in rkv::env::Rkv::from_env (in other words, this behaviour may be exhibited in other parts of the code that don't use canonicalize - and not on Android.

(In reply to Mark Goodwin [:mgoodwin] from comment #9)

this behaviour may be exhibited in other parts of the code that don't use canonicalize - and not on Android.

Ah, I spoke to soon; the other issue I'm seeing is probably unrelated (see bug 1533049 comment 2)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #8)

(In reply to Myk Melez [:myk] [@mykmelez] from comment #0)
For those following along at home, what's the workaround?

Dana: I'm sorry for the delay responding. Somehow I missed your question and needinfo request.

The Manager is an abstraction over the Rkv::new() function (and the other Rkv::* functions that return an Rkv instance). To avoid the manager, you just call those functions directly.

Note that there must be only one Rkv instance at a time for a given datastore path, and you need to enforce that constraint if you create Rkv instances yourself.

(This is the primary functionality that Manager provides, as it guarantees that it will return an existing Rkv instance for a path as long as that Rkv instance was previously created via the manager.)

(In reply to Mark Goodwin [:mgoodwin] from comment #9)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #8)
Avoiding the use of rkv::Manager means there are calls to canonicalize.

Manager canonicalizes paths because it is a general API that doesn't know the origin of the path being provided. However, a profile dir path obtained from the directory service is presumably already canonical, and thus it probably isn't necessary to canonicalize it when calling Rkv::new directly.

Flags: needinfo?(myk)
Blocks: 1515094
No longer blocks: 1515094

@glandium, you mentioned over in https://github.com/rust-lang/rust/issues/58862 that we could use __malloc_hook and friends on Android 9+ to solve the problem. Since Fennec and GeckoView require Android 16+ at this point, can we do that for Gecko, even if it doesn't solve the more general problem for Rust (because it needs to support older Android versions)?

Flags: needinfo?(mh+mozilla)

__malloc_hook wouldn't work for us, because our libraries are loaded too late to make any difference.

Also, when I said Android 9, I meant Android version 9, not Android API 9. That's Android P.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #13)

__malloc_hook wouldn't work for us, because our libraries are loaded too late to make any difference.

Also, when I said Android 9, I meant Android version 9, not Android API 9. That's Android P.

So is there a solution on our side, or should we try harder to get something changed in the rust std library?

Flags: needinfo?(mh+mozilla)

I think I have a general solution, but I need to test it.

Flags: needinfo?(mh+mozilla)
Assignee: nobody → mh+mozilla

And fixing bug 1530874 gave me a slightly different idea.

It seems we have nothing currently, in rust code, that invokes free expecting libc's. Myk, can you point me to what landed that removed the call to std::fs::canonicalize, so that I can try reverting it and see if I see the libc free call happening correctly with my WIP changes?

Flags: needinfo?(myk)

I landed a patch to avoid calling std::fs::canonicalize() from toolkit/components/kvstore/ over in bug 1542888. If you revert that patch, then kvstore should call that function when dom/notification/NotificationDB.jsm opens its database.

However, it might be simpler to just add these three lines to GkRust_Init() in toolkit/library/rust/shared/lib.rs:

    let dir = std::env::temp_dir();
    let path = std::fs::canonicalize(dir.as_path()).expect("path");
    dbg!(path);

That function gets called on startup, so this code snippet should unconditionally crash Firefox on Android if the bug is still present.

Flags: needinfo?(myk)
Depends on: 1546587
Priority: -- → P1

Mike, do you still see this one as a P1 bug as fennec is EOL ?

Flags: needinfo?(mh+mozilla)

This is still happening if the workaround is removed.

To be clear, this is happening on Android in general, not Fennec in particular, so I'd still say this is a P1.

Summary: Fennec segfaults in arena_dalloc when Rust code calls std::fs::canonicalize → Fenix segfaults in arena_dalloc when Rust code calls std::fs::canonicalize

Mike will have a look to this if it happens again (ie if a new crate makes a similar mistake)

Flags: needinfo?(mh+mozilla)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.