Crash in [@ Hacl_Chacha20Poly1305_128_aead_encrypt] & [@ Hacl_Chacha20Poly1305_128_aead_decrypt]
Categories
(NSS :: Libraries, defect, P1)
Tracking
(Root Cause:Testing Error: Other, firefox-esr68 unaffected, firefox71 unaffected, firefox72 unaffected, firefox73 fixed, firefox74blocking fixed)
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox71 | --- | unaffected |
firefox72 | --- | unaffected |
firefox73 | --- | fixed |
firefox74 | blocking | fixed |
People
(Reporter: marcia, Assigned: franziskus)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(3 files)
This bug is for crash report bp-6be0570e-6b86-4d85-928e-a696e0191220.
Seem while looking at crash stats, started in 20191220095035:
Possible regression window based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8e1b11b00157269f1f781753efc241e005efeaf1&tochange=5a7641c75a2e3d54864c23242f2b986c7ec77796
Bug 1602020? ni on J.C.
Top 10 frames of crashing thread:
0 freebl3.dll Hacl_Chacha20Poly1305_128_aead_encrypt security/nss/lib/freebl/verified/Hacl_Chacha20Poly1305_128.c:1212
1 freebl3.dll ChaCha20Poly1305_Seal security/nss/lib/freebl/chacha20poly1305.c:170
2 softokn3.dll sftk_ChaCha20Poly1305_Encrypt security/nss/lib/softoken/pkcs11c.c:710
3 softokn3.dll NSC_Encrypt security/nss/lib/softoken/pkcs11c.c:1509
4 nss3.dll PK11_Encrypt security/nss/lib/pk11wrap/pk11obj.c:979
5 nss3.dll ssl3_ChaCha20Poly1305 security/nss/lib/ssl/ssl3con.c:1845
6 nss3.dll ssl3_SendRecord security/nss/lib/ssl/ssl3con.c:2586
7 nss3.dll ssl3_FlushHandshake security/nss/lib/ssl/ssl3con.c:2797
8 nss3.dll ssl3_SendFinished security/nss/lib/ssl/ssl3con.c:11555
9 nss3.dll ssl3_SendClientSecondRound security/nss/lib/ssl/ssl3con.c:7977
Comment 1•5 years ago
|
||
I can reproduce the crash when click wikipedia of New Tab Page Top Site
Comment 2•5 years ago
|
||
Regression window:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=bba57a78314a41dd3a855def379afd401d5fd7b0&tochange=23220e6aef9d9290cb4d1b40d435145f979e962c
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
|
||
This looks like the CPU isn't supporting the instruction (but I don't see a specific CPU that could cause this in the crash reports) or is somehow broken. I'd disable this on Windows for now and add additional hardware checks.
Assignee | ||
Comment 4•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 6•5 years ago
|
||
Adding a macOS signature and changing platform to all.
Comment 7•5 years ago
|
||
These crashes are happening on older CPUs, and our initial patch in this bug to just check SSE2/SSE3 is not going to solve the issue. All the crashing systems report having SSE2 and SSE3 -- and SSE 4.1, even.
It looks like crashing computers that are reasonably easy to identify -- at least in the Mac family -- are a 2010 MacPro with a Xeon "Bloomfield" processor, and a 2008 unibody MacBook Pro using a "Penryn" Core2 Duo processor.
I actually have a Penryn-based MacBook Pro in storage, so that is probably where I'll go to more properly debug this, but given the lateness of the hour and the proximity of the holidays, rather than attempt the patch in this bug, I'm going to with heavy heart backout Bug 1574643's patches and re-uplift. We'll have to try again in the next Firefox.
Comment 8•5 years ago
|
||
The backout of the regressing changes is on autoland now:
https://hg.mozilla.org/integration/autoland/rev/0450c6ad5b85
At this point I think the HACL* updates will need to slip to NSS 3.50 / Firefox 74.
Reporter | ||
Comment 9•5 years ago
|
||
Thanks J.C. - I did note that this also affected Fenix, even though we are not able to show that in the crash stop table - see https://crash-stats.mozilla.org/report/index/290c74d6-4ad1-4e12-b8fc-ddf2b0191223 for a sample crash.
Comment 10•5 years ago
|
||
The priority flag is not set for this bug.
:jcj, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 11•5 years ago
|
||
This was fixed by backout in bug 1574643 comment 4.
Comment 12•5 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #11)
This was fixed by backout in bug 1574643 comment 4.
Unfortunately this Startup Crash in now back in the latest Nightly version, as built from https://hg.mozilla.org/mozilla-central/rev/5ba39736e74b8a072a63ee215545f89d5c2ec8c8, since it's apparently re-landed in https://bugzilla.mozilla.org/show_bug.cgi?id=1574643#c5
Can we please get that backed out again, and a new Nightly version built, please :-)
Comment 13•5 years ago
|
||
I really need to know explicit CPUs that are hitting this crash; the crash stats seemed to point to early SSE4 (Penryn) microarchitecture, which appears to all be fixed by gating the feature on also having AVX.
I'll prepare a backout once I have time later this morning.
Comment 14•5 years ago
|
||
Also, please point me to a crash link for this?
Comment 15•5 years ago
|
||
Updated•5 years ago
|
Comment 16•5 years ago
|
||
Backed out Bug 1606927 from central: https://hg.mozilla.org/mozilla-central/rev/46e1717494fbf0f1913e22052c3304fe9f7c19b8 and retriggered desktop nightlies.
Comment 17•5 years ago
|
||
(In reply to J.C. Jones [:jcj] (he/him) from comment #13)
I really need to know explicit CPUs that are hitting this crash;
I'm attaching a (redacted) report from the "CPUID CPU-Z" program, which hopefully contains enough information to be useful. (If you need the full report, I'd be happy to provide that privately.)
Comment 18•5 years ago
|
||
I have a Core 2 Duo E8400 which is based on the Penryn microarchitecture. I checked with mozregression that I was also affected by this crash in December because I didn't have it then.
Comment 19•5 years ago
|
||
Thanks, both. It looks like basically the fix was incomplete, but that our gating-on-AVX is the correct fix, we just missed a path that isn't always hit. Except it's always hit for your cases -- sorry about that! Incomplete testing.
(My Penryn Bacbook isn't updated enough to test whole Firefox, so the tests were too fine-grained and didn't hit the erroneous flow the way a full Firefox does. I'm upgrading that to El Capitan right now for verification before re-landing yet again)
Comment 20•5 years ago
|
||
I also get this crash, but this seems to be a different micro architecture Kaby Lake (Intel(R) Pentium(R) CPU 4415U).
Comment 21•5 years ago
|
||
If anyone with affected hardware is able to test with [1] (Mac) or [2] (Linux/Win), that would be very helpful.
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=511735bd82c92b1d4349f5d7b84c3333e3c77930&selectedJob=285079074
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=d51d4c953f845dcf63cccbbcfc95157867b911c1
Comment 22•5 years ago
|
||
I tested the try build on Linux x86-64, connecting to and using wikipedia.org works (also via search keyword), browser test on ssllabs.com shows no problems with ChaCha suites. Anything else that might be helpful?
Comment 23•5 years ago
|
||
Thanks, Viktor! The most important thing would be to verify the ciphersuite used on Wikipedia is actually exercising ChaCha20. You can do this by clicking the padlock in the address bar > Right arrow > More Information. "CHACHA20" should be somewhere in the ciphersuite string under "Technical Details"). You could also try the parent build, which should crash.
We're still waiting on OS updates to our machine that has one of these CPUs. Once done we'll go through the same steps.
Comment 24•5 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #23)
The most important thing would be to verify the ciphersuite used on Wikipedia is actually exercising ChaCha20. You can do this by clicking the padlock in the address bar > Right arrow > More Information. "CHACHA20" should be somewhere in the ciphersuite string under "Technical Details").
I actually did this to verify it. I also wanted to verify this on google.com, but I don't get ChaCha20 by default there, so I disabled TLS 1.3 (I don't know how to force the use of ChaCha20 with TLS 1.3 in Firefox) and set "security.ssl3.ecdhe_ecdsa_aes_128_gcm_sha256" to "false", and then tried to find out the ciphersuite and unfortunately got the crash from bug 1609144. So I had to check via the developer tools that the connection indeed used ChaCha20.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 25•5 years ago
|
||
(In reply to Tom Schuster [:evilpie] from comment #20)
Created attachment 9121122 [details]
cat /proc/cpuinfoI also get this crash, but this seems to be a different micro architecture Kaby Lake (Intel(R) Pentium(R) CPU 4415U).
This is fascinating; Kaby Lake has AVX, but that particular CPU does not, and your cpuinfo concurs (of course): https://ark.intel.com/content/www/us/en/ark/products/96508/intel-pentium-gold-processor-4415u-2m-cache-2-30-ghz.html
I've opened Bug 1609569 to track fixing at least the Pentium Gold-series processors that have SSE4 but no AVX.
Comment 26•5 years ago
|
||
https://phabricator.services.mozilla.com/D60086 has the new uplift to m-c with the changes from Comment 21.
Comment 27•5 years ago
|
||
Possibly related https://bugzilla.mozilla.org/show_bug.cgi?id=1609492
The cpu on this desktop running nightly 74.0a1 is an Intel Pentium CPU G4500 @ 3.50GHz where the /proc/cpuinfo claims :
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Strangely an Intel Core i7-4810MQ CPU @ 2.80GHz seems to run nightly just fine. On the i7 machine I see :
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
I would need to ask/sed/grep-fu these to get the diff :
3dnowprefetch art avx avx2 bmi1 bmi2 clflushopt epb
f16c fma hwp hwp_act_window hwp_epp hwp_notify
ibpb ibrs ida intel_pt rdseed smx stibp
tsc_known_freq xgetbv1 xsavec xsaves
Those above items are unique between the cpu types but the Intel i7 runs nightly fine :
https://hg.mozilla.org/mozilla-central/rev/c35bb210b8ae793c844bd94c1848d246bf601293
The Intel Pentium G4500 crashes repeatedly per 1609492.
--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional
Comment 28•5 years ago
|
||
The priority flag is not set for this bug.
:jcj, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 29•5 years ago
|
||
@Dennis Clarke:
Wikipedia [1] says (without source) about Intel CPUs with AVX support: "Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination "Core i3/i5/i7" support them, whereas "Pentium" and "Celeron" CPUs don't."
This is probably the reason for what you describe in bug 1609492, which looks like a duplicate of this bug. I also got this crash when I tried to connect to Wikipedia, and Wikipedia is the only site that I know for which I get ChaCha20 encryption by default (i.e. without changing TLS related settings in about:config).
[1] https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX
Comment 30•5 years ago
|
||
The offending code, with further AVX-checks, will be in the next Firefox Nightly (https://hg.mozilla.org/mozilla-central/rev/ef97ef459394
). And I agree that bug 1609492 is a duplicate.
Thanks for running down the CPU differences there Dennis, that's helpful to confirm our suspicion. While debugging in Nightly is a terrible thing, we are reasonably sure we got everything this time.
Note that I've opened Bug 1609569 to find a path to avoid needing AVX here. It's curious because inspection of the relevant code will show we use no AVX instructions -- the optimizer is adding some somewhere --- probably a 256-bit set operation or something --- and I imagine with some tinkering we might be able to broaden the accelerated code back out to all SSE4.1 chips.
I'm marking this fixed again under the assumption that the crash will not recur. Here's hoping!
Comment 32•5 years ago
|
||
We've encountered similar problems in HACL* and we tend to see these issues as a build problem.
If a file (any file) is compiled with -mavx
(as the NSS build seems to do), then the compiler is free to select AVX instructions at any time, including things like array initialization that don't even use intrinsics. (We've seen this...) From what I gather, the Hacl_Foo_128.c
files are compiled with -mavx -msse4.1 -maes -m...
.
So, run-time testing for AVX is one fix (ideally you should run-time test for all the instruction sets the file you're about to jump into was compiled with), but the other trivial fix is to just remove the -mavx
flag when compiling Hacl_Foo_128.c
.
From my testing on an old Macbook Pro with SSE4.1 but no AVX, it suffices to remove the -mavx
flag when compiling Hacl_Foo_128.c
and then you won't need the extra AVX run-time test.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 33•5 years ago
|
||
Please specify a root cause for this bug. See :tmaity for more information.
Comment 34•5 years ago
|
||
Root Cause as a testing error, that we did not have CI coverage for Pentium Gold and similar non-AVX processors.
Updated•5 years ago
|
Description
•