Closed Bug 1605369 Opened 5 years ago Closed 5 years ago

Crash in [@ Hacl_Chacha20Poly1305_128_aead_encrypt] & [@ Hacl_Chacha20Poly1305_128_aead_decrypt]

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Root Cause:Testing Error: Other, firefox-esr68 unaffected, firefox71 unaffected, firefox72 unaffected, firefox73 fixed, firefox74blocking fixed)

RESOLVED FIXED
Tracking Status
firefox-esr68 --- unaffected
firefox71 --- unaffected
firefox72 --- unaffected
firefox73 --- fixed
firefox74 blocking fixed

People

(Reporter: marcia, Assigned: franziskus)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

This bug is for crash report bp-6be0570e-6b86-4d85-928e-a696e0191220.

Seem while looking at crash stats, started in 20191220095035:

Possible regression window based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8e1b11b00157269f1f781753efc241e005efeaf1&tochange=5a7641c75a2e3d54864c23242f2b986c7ec77796

Bug 1602020? ni on J.C.

Top 10 frames of crashing thread:

0 freebl3.dll Hacl_Chacha20Poly1305_128_aead_encrypt security/nss/lib/freebl/verified/Hacl_Chacha20Poly1305_128.c:1212
1 freebl3.dll ChaCha20Poly1305_Seal security/nss/lib/freebl/chacha20poly1305.c:170
2 softokn3.dll sftk_ChaCha20Poly1305_Encrypt security/nss/lib/softoken/pkcs11c.c:710
3 softokn3.dll NSC_Encrypt security/nss/lib/softoken/pkcs11c.c:1509
4 nss3.dll PK11_Encrypt security/nss/lib/pk11wrap/pk11obj.c:979
5 nss3.dll ssl3_ChaCha20Poly1305 security/nss/lib/ssl/ssl3con.c:1845
6 nss3.dll ssl3_SendRecord security/nss/lib/ssl/ssl3con.c:2586
7 nss3.dll ssl3_FlushHandshake security/nss/lib/ssl/ssl3con.c:2797
8 nss3.dll ssl3_SendFinished security/nss/lib/ssl/ssl3con.c:11555
9 nss3.dll ssl3_SendClientSecondRound security/nss/lib/ssl/ssl3con.c:7977

Flags: needinfo?(jjones)

I can reproduce the crash when click wikipedia of New Tab Page Top Site

Flags: needinfo?(franziskuskiefer)
Regressed by: hacl-ci-2
Has Regression Range: --- → yes
Crash Signature: [@ Hacl_Chacha20Poly1305_128_aead_encrypt] → [@ Hacl_Chacha20Poly1305_128_aead_encrypt] [@ Hacl_Chacha20Poly1305_128_aead_decrypt ]

This looks like the CPU isn't supporting the instruction (but I don't see a specific CPU that could cause this in the crash reports) or is somehow broken. I'd disable this on Windows for now and add additional hardware checks.

Assignee: nobody → nobody
Component: Security: PSM → Libraries
Flags: needinfo?(franziskuskiefer)
Product: Core → NSS
QA Contact: jjones
Target Milestone: --- → 3.49
Version: Trunk → trunk
Assignee: nobody → franziskuskiefer
Status: NEW → ASSIGNED
Severity: normal → critical
OS: Windows 10 → Windows
Hardware: Unspecified → Desktop
Summary: Crash in [@ Hacl_Chacha20Poly1305_128_aead_encrypt] → Crash in [@ Hacl_Chacha20Poly1305_128_aead_encrypt] & [@ Hacl_Chacha20Poly1305_128_aead_decrypt]

Adding a macOS signature and changing platform to all.

Crash Signature: [@ Hacl_Chacha20Poly1305_128_aead_encrypt] [@ Hacl_Chacha20Poly1305_128_aead_decrypt ] → [@ Hacl_Chacha20Poly1305_128_aead_encrypt] [@ Hacl_Chacha20Poly1305_128_aead_decrypt ] [@ Hacl_Chacha20_Vec128_chacha20_encrypt_128 ]
OS: Windows → All
Hardware: Desktop → All

These crashes are happening on older CPUs, and our initial patch in this bug to just check SSE2/SSE3 is not going to solve the issue. All the crashing systems report having SSE2 and SSE3 -- and SSE 4.1, even.

It looks like crashing computers that are reasonably easy to identify -- at least in the Mac family -- are a 2010 MacPro with a Xeon "Bloomfield" processor, and a 2008 unibody MacBook Pro using a "Penryn" Core2 Duo processor.

I actually have a Penryn-based MacBook Pro in storage, so that is probably where I'll go to more properly debug this, but given the lateness of the hour and the proximity of the holidays, rather than attempt the patch in this bug, I'm going to with heavy heart backout Bug 1574643's patches and re-uplift. We'll have to try again in the next Firefox.

Flags: needinfo?(jjones)

The backout of the regressing changes is on autoland now:
https://hg.mozilla.org/integration/autoland/rev/0450c6ad5b85

At this point I think the HACL* updates will need to slip to NSS 3.50 / Firefox 74.

Thanks J.C. - I did note that this also affected Fenix, even though we are not able to show that in the crash stop table - see https://crash-stats.mozilla.org/report/index/290c74d6-4ad1-4e12-b8fc-ddf2b0191223 for a sample crash.

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)

This was fixed by backout in bug 1574643 comment 4.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(jjones)
Resolution: --- → FIXED

(In reply to Ryan VanderMeulen [:RyanVM] from comment #11)

This was fixed by backout in bug 1574643 comment 4.

Unfortunately this Startup Crash in now back in the latest Nightly version, as built from https://hg.mozilla.org/mozilla-central/rev/5ba39736e74b8a072a63ee215545f89d5c2ec8c8, since it's apparently re-landed in https://bugzilla.mozilla.org/show_bug.cgi?id=1574643#c5

Can we please get that backed out again, and a new Nightly version built, please :-)

I really need to know explicit CPUs that are hitting this crash; the crash stats seemed to point to early SSE4 (Penryn) microarchitecture, which appears to all be fixed by gating the feature on also having AVX.

I'll prepare a backout once I have time later this morning.

Flags: needinfo?(jjones)

Also, please point me to a crash link for this?

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached file cpu.txt (deleted) —

(In reply to J.C. Jones [:jcj] (he/him) from comment #13)

I really need to know explicit CPUs that are hitting this crash;

I'm attaching a (redacted) report from the "CPUID CPU-Z" program, which hopefully contains enough information to be useful. (If you need the full report, I'd be happy to provide that privately.)

I have a Core 2 Duo E8400 which is based on the Penryn microarchitecture. I checked with mozregression that I was also affected by this crash in December because I didn't have it then.

Thanks, both. It looks like basically the fix was incomplete, but that our gating-on-AVX is the correct fix, we just missed a path that isn't always hit. Except it's always hit for your cases -- sorry about that! Incomplete testing.

(My Penryn Bacbook isn't updated enough to test whole Firefox, so the tests were too fine-grained and didn't hit the erroneous flow the way a full Firefox does. I'm upgrading that to El Capitan right now for verification before re-landing yet again)

Flags: needinfo?(jjones)
Attached file cat /proc/cpuinfo (deleted) —

I also get this crash, but this seems to be a different micro architecture Kaby Lake (Intel(R) Pentium(R) CPU 4415U).

I tested the try build on Linux x86-64, connecting to and using wikipedia.org works (also via search keyword), browser test on ssllabs.com shows no problems with ChaCha suites. Anything else that might be helpful?

Thanks, Viktor! The most important thing would be to verify the ciphersuite used on Wikipedia is actually exercising ChaCha20. You can do this by clicking the padlock in the address bar > Right arrow > More Information. "CHACHA20" should be somewhere in the ciphersuite string under "Technical Details"). You could also try the parent build, which should crash.

We're still waiting on OS updates to our machine that has one of these CPUs. Once done we'll go through the same steps.

(In reply to Kevin Jacobs [:kjacobs] from comment #23)

The most important thing would be to verify the ciphersuite used on Wikipedia is actually exercising ChaCha20. You can do this by clicking the padlock in the address bar > Right arrow > More Information. "CHACHA20" should be somewhere in the ciphersuite string under "Technical Details").

I actually did this to verify it. I also wanted to verify this on google.com, but I don't get ChaCha20 by default there, so I disabled TLS 1.3 (I don't know how to force the use of ChaCha20 with TLS 1.3 in Firefox) and set "security.ssl3.ecdhe_ecdsa_aes_128_gcm_sha256" to "false", and then tried to find out the ciphersuite and unfortunately got the crash from bug 1609144. So I had to check via the developer tools that the connection indeed used ChaCha20.

(In reply to Tom Schuster [:evilpie] from comment #20)

Created attachment 9121122 [details]
cat /proc/cpuinfo

I also get this crash, but this seems to be a different micro architecture Kaby Lake (Intel(R) Pentium(R) CPU 4415U).

This is fascinating; Kaby Lake has AVX, but that particular CPU does not, and your cpuinfo concurs (of course): https://ark.intel.com/content/www/us/en/ark/products/96508/intel-pentium-gold-processor-4415u-2m-cache-2-30-ghz.html

I've opened Bug 1609569 to track fixing at least the Pentium Gold-series processors that have SSE4 but no AVX.

https://phabricator.services.mozilla.com/D60086 has the new uplift to m-c with the changes from Comment 21.

Possibly related https://bugzilla.mozilla.org/show_bug.cgi?id=1609492

The cpu on this desktop running nightly 74.0a1 is an Intel Pentium CPU G4500 @ 3.50GHz where the /proc/cpuinfo claims :

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp

Strangely an Intel Core i7-4810MQ CPU @ 2.80GHz seems to run nightly just fine. On the i7 machine I see :

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

I would need to ask/sed/grep-fu these to get the diff :

3dnowprefetch art avx avx2 bmi1 bmi2 clflushopt epb 
f16c fma hwp hwp_act_window hwp_epp hwp_notify 
ibpb ibrs ida intel_pt rdseed smx stibp 
tsc_known_freq xgetbv1 xsavec xsaves

Those above items are unique between the cpu types but the Intel i7 runs nightly fine :

https://hg.mozilla.org/mozilla-central/rev/c35bb210b8ae793c844bd94c1848d246bf601293

The Intel Pentium G4500 crashes repeatedly per 1609492.

--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)

@Dennis Clarke:
Wikipedia [1] says (without source) about Intel CPUs with AVX support: "Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination "Core i3/i5/i7" support them, whereas "Pentium" and "Celeron" CPUs don't."

This is probably the reason for what you describe in bug 1609492, which looks like a duplicate of this bug. I also got this crash when I tried to connect to Wikipedia, and Wikipedia is the only site that I know for which I get ChaCha20 encryption by default (i.e. without changing TLS related settings in about:config).

[1] https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX

The offending code, with further AVX-checks, will be in the next Firefox Nightly (https://hg.mozilla.org/mozilla-central/rev/ef97ef459394
). And I agree that bug 1609492 is a duplicate.

Thanks for running down the CPU differences there Dennis, that's helpful to confirm our suspicion. While debugging in Nightly is a terrible thing, we are reasonably sure we got everything this time.

Note that I've opened Bug 1609569 to find a path to avoid needing AVX here. It's curious because inspection of the relevant code will show we use no AVX instructions -- the optimizer is adding some somewhere --- probably a 256-bit set operation or something --- and I imagine with some tinkering we might be able to broaden the accelerated code back out to all SSE4.1 chips.

I'm marking this fixed again under the assumption that the crash will not recur. Here's hoping!

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Flags: needinfo?(jjones)
Priority: -- → P1
Resolution: --- → FIXED

We've encountered similar problems in HACL* and we tend to see these issues as a build problem.

If a file (any file) is compiled with -mavx (as the NSS build seems to do), then the compiler is free to select AVX instructions at any time, including things like array initialization that don't even use intrinsics. (We've seen this...) From what I gather, the Hacl_Foo_128.c files are compiled with -mavx -msse4.1 -maes -m....

So, run-time testing for AVX is one fix (ideally you should run-time test for all the instruction sets the file you're about to jump into was compiled with), but the other trivial fix is to just remove the -mavx flag when compiling Hacl_Foo_128.c.

From my testing on an old Macbook Pro with SSE4.1 but no AVX, it suffices to remove the -mavx flag when compiling Hacl_Foo_128.c and then you won't need the extra AVX run-time test.

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?

Root Cause as a testing error, that we did not have CI coverage for Pentium Gold and similar non-AVX processors.

Root Cause: Testing Error → ---
Root Cause: --- → Testing Error
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: