Closed Bug 1455828 Opened 7 years ago Closed 7 years ago

Crash in PrefValue::Deserialize

Categories

(Core :: Preferences: Backend, defect)

Unspecified
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 + verified

People

(Reporter: calixte, Assigned: jld)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

This bug was filed from the Socorro interface and is report bp-65d48a55-59e4-4d59-bd7e-ceff90180420. ============================================================= Top 10 frames of crashing thread: 0 libxul.so PrefValue::Deserialize modules/libpref/Preferences.cpp:305 1 libxul.so Pref::Deserialize modules/libpref/Preferences.cpp:912 2 libxul.so mozilla::Preferences::DeserializePreferences modules/libpref/Preferences.cpp:3421 3 libxul.so mozilla::dom::ContentProcess::Init dom/ipc/ContentProcess.cpp:215 4 libxul.so XRE_InitChildProcess toolkit/xre/nsEmbedFunctions.cpp:695 5 firefox content_process_main ipc/contentproc/plugin-container.cpp:50 6 firefox main.cold.3 7 libc-2.23.so libc-2.23.so@0x2082f 8 firefox firefox@0x15d9f 9 firefox double_conversion::BignumDtoa ============================================================= There are 33 crashes (from 30 installations) in nightly 61 starting with buildid 20180420100056. :njn, could you investigate please ? By the way, for all the crashes, the uptime is 0 but the startup_crash flag is unset. Is the deserialization occuring during the startup or not ?
Flags: needinfo?(n.nethercote)
The crashes are all on Linux, and are frequent enough that this is the #1 topcrash on Linux Nightly. As per comment 0, they started in build 20180420100056, which is when bug 1439057 and bug 1447867 landed. So I think it's highly likely that one or both of those is to blame. jld, what do you think? The crash that's occurring is caused by the serialized pref data in the shared memory having an unexpected form. It's hard to know exactly what that invalid form is. Perhaps it's full of zeroes or something? Given the high crash rate, I think those two bugs' patches should be backed out until we work out what's happening. jld, can you do that?
Flags: needinfo?(n.nethercote) → needinfo?(jld)
I don't really understand what's going on here. The code I changed was only about obtaining the fd; if that works, the rest of it shouldn't be affected. (And on Linux it should be a regular file in the same directory as before.) One observable change is that I stopped seeking to the end of the file, because mmap doesn't use the file offset, but I didn't find anything trying to read/write the fd directly so that shouldn't matter. The other change is the leaf filename, but if something like AppArmor were filtering based on names that should have made the file creation fail, and then we'd never create the child process.
For information, there are no more crashes with this signature since the patches from bug 1439057 and bug 1447867 have been backed out.
Some other differences: the old code is setting O_APPEND via fdopen, as well as seeking to the end. If there's a stray write to the fd somehow, previously it would have written after the pref blob's null terminator and been ignored (because the length is passed on the command line instead of fstat'ing the file), but now it would corrupt the start. And the old code wasn't setting close-on-exec, while the new code does (via shm_open). This is potentially relevant because — and this is a separate bug I need to file — the IPC fd-shuffling code doesn't appear to unset the close-on-exec flag if an fd winds up mapped to the same fd. From looking at the code: the byte that isn't a valid pref type also isn't NUL, which means we're not reading past the end of the file (or an un-written part of it), which rules out a few things. And we could crash earlier, in the pref type switch, instead of NS_ERRORing: if we get to that case we're going to crash later anyway, but with less context. It would also be possible to include the bad data in the minidump.
Assignee: nobody → jld
Flags: needinfo?(jld)
> And we could crash earlier, in the pref type switch, instead of NS_ERRORing: > if we get to that case we're going to crash later anyway, but with less > context. It would also be possible to include the bad data in the minidump. Yes, the current handling of unexpected data is totally unprincipled. We could crash more eagerly, in more places, and use UNSAFE_PRINTF to show some of the data.
This is currently the top browser crash on Nightly.
Keywords: topcrash
This was fixed by the backouts, as per comment 2. I think we can close it. Bug 1456902 is open for some follow-up work.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → VERIFIED
Resolution: WORKSFORME → FIXED
Target Milestone: --- → mozilla61

Just by clicking on the signature, we still have numbers of crashes. Jld, is it still caused by the same reason, do you mind take a look? Thanks!

Flags: needinfo?(jld)

(In reply to Sean Feng [:sefeng] from comment #8)

Just by clicking on the signature, we still have numbers of crashes. Jld, is it still caused by the same reason, do you mind take a look? Thanks!

I filed bug 1767026 for this. It looks like a severe enough spike on Nightly that I think it is worth a new bug.

It's not clear if I ever really knew what's going on in this bug, but the comments in bug 1767026 so far suggest that the recent crashes might be a recent regression.

Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.