Closed
Bug 1409444
Opened 7 years ago
Closed 7 years ago
stylo: Crash in rand::weak_rng
Categories
(Core :: CSS Parsing and Computation, defect, P3)
Tracking
()
VERIFIED
FIXED
mozilla58
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox56 | --- | unaffected |
firefox57 | + | verified |
firefox58 | --- | verified |
People
(Reporter: philipp, Assigned: manishearth)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files, 1 obsolete file)
(deleted),
text/x-review-board-request
|
manishearth
:
review+
ritu
:
approval-mozilla-beta+
|
Details |
(deleted),
patch
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-ed6d1e84-c4e2-402e-b778-05a390171017.
=============================================================
Crashing Thread (29)
Frame Module Signature Source
0 xul.dll std::panicking::rust_panic_with_hook src/libstd/panicking.rs:555
1 xul.dll std::panicking::begin_panic<collections::string::String> src/libstd/panicking.rs:511
2 xul.dll std::panicking::begin_panic_fmt src/libstd/panicking.rs:495
3 xul.dll rand::weak_rng third_party/rust/rand/src/lib.rs:856
4 xul.dll std::sys_common::backtrace::__rust_begin_short_backtrace<closure, ()> src/libstd/sys_common/backtrace.rs:133
5 xul.dll alloc::boxed::{{impl}}::call_box<(), closure> src/liballoc/boxed.rs:647
6 xul.dll std::sys::imp::thread::{{impl}}::new::thread_start src/libstd/sys/windows/thread.rs:50
7 kernel32.dll BaseThreadInitThunk
8 ntdll.dll RtlUserThreadStart
this rust panic signature in content processes is regressing in larger numbers since 57.0b6 and 58.0a1 build 20171006220306.
bug fixes that went into b6 were: https://mzl.la/2fPo1by
the crashes seem to hit particular installations repeatedly and come with various moz crash reasons:
https://crash-stats.mozilla.com/search/?signature=%3Drand%3A%3Aweak_rng&date=%3E%3D2017-10-01T00%3A00%3A00.000Z&_facets=moz_crash_reason#facet-moz_crash_reason
Comment 1•7 years ago
|
||
Manish, given your experience debugging the hashmap RNG failures, any thoughts on these weak_rng() panics after OsRng::new() fails? The errors affect all Windows versions (plus a few OS X 10.12.x users), but 94% of the crash reports are from Windows 7.
Perhaps the Rust rand library should work around these RNG failures in the Windows implementation of OsRng::new(). That would fix the whole Rust ecosystem and we wouldn't need to play Whac-A-Mole with RNG panics in Firefox. :)
Blocks: stylo-crash-reports
Crash Signature: [@ rand::weak_rng] → [@ mozalloc_abort | abort | rand::weak_rng]
[@ rand::weak_rng]
Flags: needinfo?(manishearth)
Priority: -- → P3
Assignee | ||
Comment 2•7 years ago
|
||
This is the same problem, OsRng sometimes gets funky failures from Windows, as far as I can tell due to broken DLLs.
We fixed the last one by removing uses of rng from hashmaps. This can't be fixed the same way; this is part of the rust runtime (and the fix would need to be in the stdlib's copy of the rand crate)
Flags: needinfo?(manishearth)
Comment 3•7 years ago
|
||
(In reply to Manish Goregaokar [:manishearth] from comment #2)
> We fixed the last one by removing uses of rng from hashmaps. This can't be
> fixed the same way; this is part of the rust runtime (and the fix would need
> to be in the stdlib's copy of the rand crate)
Yes, I'm suggesting we fix this upstream in Rust's stdlib.
Assignee | ||
Comment 4•7 years ago
|
||
I don't see how this can be fixed, though. The OS's RNG returned a weird error code; what *can* we do?
Assignee | ||
Comment 5•7 years ago
|
||
Also, this isn't actually a runtime bug, I realized that the backtrace indicates that we had a weak_rng failure in a thread we spawned.
Looking at callers of weak_rng apparently Rayon threads initialize an RNG
Comment 6•7 years ago
|
||
weak_rng could silently fall back to a weaker seed value like the current time if OsRng fails. weak_rng doesn't make any promises about being cryptographically secure.
Comment 7•7 years ago
|
||
Alex, as an owner of the "rand" crate, what do you think about changing weak_rng to silently fall back to a weaker seed value (like the current time + whatever) instead of panicking when OsRng::new() returns an error?
Crashing Firefox because weak_rng has a weak seed seems pretty extreme. These crashes currently only affect the Firefox content process, but they will affect the main browser process once we start using Stylo for XUL/chrome documents.
This panic is currently Windows content process top crash #10 in Beta 57. It's only #63 in Nightly 58, probably because fewer Nightly users have broken Windows 7 systems.
Besides Windows' CryptAcquireContext returning weird errors, we've hit OS RNG failures in Firefox for Android trying to open("/dev/urandom") because the OS ran out of fds in automation (bug 1140806).
Flags: needinfo?(acrichton)
Assignee | ||
Comment 8•7 years ago
|
||
I also filed https://github.com/rust-lang-nursery/rand/issues/180
Comment 9•7 years ago
|
||
Looking over this again. It seems like in any case you're considering upgrading and/or updating the `rand` crate, right? If we were to implement a fix, that is, you'd presumably update. Looking at the crash report you're currently using 0.3.15 of the `rand` crate, and in the meantime we've since published up to 0.3.17 for the `rand` crate.
The updates notably include https://github.com/rust-lang-nursery/rand/issues/111 which switches the implementation of OS randomness on Windows to a private API in Windows itself. On that issue it was also indicated that it's the same strategy Gecko takes I believe?
Basically what I'm thinking is that if you're thinking of updating anyway to fix this, maybe it's worth updating to 0.3.17 to pull in that fix and see what happens? That may end up transitively fixing https://github.com/rust-lang/rust/issues/44911 which would I think solve this issue, right?
Failing that it seems fine to me to fall back to some sort of a weaker seed for a "weak rng", it is, after all, weak!
Flags: needinfo?(acrichton)
Comment 10•7 years ago
|
||
Manish, can you try updating Nightly 58 from `rand` crate version 0.3.15 to 0.3.17 to see if https://github.com/rust-lang-nursery/rand/issues/111 avoids this crash?
This crash signature is content process top crash #8 in Beta 57, which is pretty high. It would be good to address this somehow, though I doubt we would want to uplift the `rand` 0.3.17 crate this late in Beta 57.
For some reason, this signature jumped in Beta 57 57.0b6 (2017-10-03) and continues to get worse:
57.0b3 = 3 reports
57.0b4 = 1
57.0b5 = 1
57.0b6 = 117
57.0b7 = 188
57.0b8 = 296
57.0b9 = 369
The pushlog between 57.0b5 and 57.0b6 includes your FnvHashMap fix (bug 1385971). Maybe we just swapped the hash::map::{{impl}}::new::KEYS::__init crashes for weak_rng for these sad Windows 7 users?
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_57_0b5_RELEASE&tochange=FIREFOX_57_0b6_RELEASE
Flags: needinfo?(manishearth)
Comment 11•7 years ago
|
||
[Tracking Requested - why for this release]:
This Stylo crash signature is content process top crash #8 in Beta 57.
This crash spiked in 57.0b6 after bug 1385971 fixed a similar crash (thus allowing those crashing users to hit this new crash instead).
tracking-firefox57:
--- → ?
Comment hidden (mozreview-request) |
Assignee | ||
Comment 13•7 years ago
|
||
Flags: needinfo?(manishearth)
Comment 14•7 years ago
|
||
Lots of .cargo-ok changes... Try using cargo-vendor 0.1.11? Or we should make some progress on bug 1403407...
Comment 15•7 years ago
|
||
Wait, aren't this crashes inside rust stdlib support? How does updating the crate help?
Comment 16•7 years ago
|
||
(In reply to Emilio Cobos Álvarez [:emilio] from comment #15)
> Wait, aren't this crashes inside rust stdlib support? How does updating the
> crate help?
err, s/support/code/, but my question still stands, this updates the crate in the rust libs, but not in the stdlib, right? (Though it may be the case that I'm missing something).
Assignee | ||
Comment 17•7 years ago
|
||
They're not. I thought they were.
They're in the Rayon thread initialization code. I looked at the stacks initially and thought they were in the Rust runtime but on looking closer the rust runtime does not rely on weak_rng.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → manishearth
Comment 18•7 years ago
|
||
mozreview-review |
Comment on attachment 8920726 [details]
Bug 1409444 - stylo: Update rand to 0.3.17;
https://reviewboard.mozilla.org/r/191740/#review197030
This seems to be updating lots of unrelated things... Please just update rand (and maybe its dependencies).
Attachment #8920726 -
Flags: review?(xidorn+moz)
Assignee | ||
Comment 19•7 years ago
|
||
That is what I did. cargo update -p does the minimum possible update.
Comment hidden (mozreview-request) |
Comment 21•7 years ago
|
||
This should be the minimum possible update.
Assignee | ||
Comment 22•7 years ago
|
||
mozreview-review |
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.
https://reviewboard.mozilla.org/r/191906/#review197082
Attachment #8920947 -
Flags: review?(manishearth) → review+
Comment 23•7 years ago
|
||
Pushed by xquan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2dce277776e9
Update rand to 0.3.17. r=manishearth
Comment 24•7 years ago
|
||
Tracking 57+ based on Comment 11.
Updated•7 years ago
|
Comment 25•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Comment 26•7 years ago
|
||
Do you think we should uplift it to 57? If so, could you create the uplift request? I'm not sure about all the details with this bug.
Flags: needinfo?(manishearth)
Assignee | ||
Updated•7 years ago
|
Attachment #8920726 -
Attachment is obsolete: true
Flags: needinfo?(manishearth)
Assignee | ||
Comment 27•7 years ago
|
||
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.
Approval Request Comment
[Feature/Bug causing the regression]: Stylo
[User impact if declined]: Crashes on Windows systems with broken crypto DLLs
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: We haven't seen any crashes yet
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Updates some Rust libraries.
[String changes made/needed]: No
Attachment #8920947 -
Flags: approval-mozilla-beta?
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.
This crash has ~500 instances in a week, let's take the fix, beta57+
Attachment #8920947 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 30•7 years ago
|
||
Hold on. We may also need to address bug 1411250 first.
Comment 31•7 years ago
|
||
(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #30)
> Hold on. We may also need to address bug 1411250 first.
This uplift will hopefully fix content process top crash #5 that Windows Beta 57 users are hitting. I don't think this uplift needs to wait for bug 1411250 to fix a Linux test failure. We can uplift bug 1411250 when its fix is available.
Comment 32•7 years ago
|
||
This fix has already missed a beta now because of the conflicts (and there's only two left this cycle). Can we please move forward with a rebased patch?
Flags: needinfo?(xidorn+moz)
Comment 33•7 years ago
|
||
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(manishearth)
Comment 34•7 years ago
|
||
I guess I can just land it directly.
Comment 35•7 years ago
|
||
Updated•7 years ago
|
Target Milestone: mozilla58 → mozilla57
Comment 37•7 years ago
|
||
I think this speculative fix is working because we haven't seen any weak_rng crash reports in Nightly 58.0a1 since we landed this fix on October 23 (build id 2017, though the crash volume had been very low on Nightly:
https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=58.0a1&platform=Windows&date=%3E%3D2017-04-30T15%3A51%3A19.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id
This fix will ship in Beta 57.0b13 today, so we'll see if the Beta crash reports stop after 57.0b12:
https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=57.0b&platform=Windows&date=%3E%3D2017-04-30T08%3A51%3A19.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version
Comment 38•7 years ago
|
||
There have been no weak_rng crash reports in 56.0b13, so we can assume we've fixed the crash!
https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=57.0b&platform=Windows&date=%3E%3D2017-10-23T10%3A27%3A29.000Z&date=%3C2017-11-06T09%3A27%3A29.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version
Updated•5 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•