Closed Bug 1550579 Opened 6 years ago Closed 5 years ago

On Fenix (Arm32) ec_Curve25519_mul() takes a surprising amount of time

Categories

(NSS :: Libraries, defect, P1)

ARM
Android
defect

Tracking

(Performance Impact:high, firefox-esr60 wontfix, firefox-esr68 wontfix, firefox66 wontfix, firefox67 wontfix, firefox67.0.1 wontfix, firefox68 wontfix, firefox69 fixed)

RESOLVED FIXED
Performance Impact high
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- wontfix
firefox69 --- fixed

People

(Reporter: jesup, Assigned: kjacobs)

References

(Blocks 1 open bug)

Details

(Keywords: perf, perf:pageload, Whiteboard: [geckoview:fenix:p2] [bcs:p1])

Attachments

(1 file)

When profiling Fenix pageloads on a Motorola G5 (Cortex-A53 octacore) generating keys on the master process Socket Thread seems to take an inordinately long time, most of it in ec_Curve25519_mul().

Note that this is 32-bit ARM code; even though the processor can run AArch64.

See https://perfht.ml/2Vwm1MJ and https://perfht.ml/2VnphJY (Gecko Profiler capture of nytimes.com, 4ms sampling) and https://perfht.ml/2VsbsKc (captured with simpleperf -- this implies that mul() and square() are calling into the kernel, which is hard to verify with a Gecko Profiler profile, due to how profiles are captured). In any case, though, it's spending a long time generating keys when loading pages (cold).

Whiteboard: [qf][gv] → [qf][geckoview]

Before I looked, my thought was that we were hitting one of these two problems:

  1. Lots of connections. We can't fix this, but it's worth looking into how many we have (and why). We might be making extra connections, and we could maybe dial back our aggression a little for mobile. It does appear that the site is using HTTP/2, so we're not into the HTTP/1.1 spread - wins might not be easy in this area. We still seem to have 120 connections though...

  2. Too many key generation calls. I thought at first this might have been https://searchfox.org/mozilla-central/rev/b9da45f63cb567244933c77b2c7e827a057d3f9b/security/manager/ssl/nsNSSIOLayer.cpp#2433 which causes us to generate two key shares for each connection, but I only see 4ms/128ms heading that way in the profile I checked.

But Randell is probably right to attribute this to something in the multiply function. We are probably hitting the slow 32-bit code. 25519 without a 64-bit multiply is pretty bad. There might be a build configuration we can hit that will result in deploying faster 64-bit code on this platform, or maybe we should find a way to ship both sets of code and switch hit at runtime.

Summary: On Fenix ec_Curve25519_mul() takes a surprising amount of time → On Fenix (Arm32) ec_Curve25519_mul() takes a surprising amount of time

Regarding (1), in https://perfht.ml/2VrxRrn we're seeing lots of time there, and only about max 18 connections in use (and less being started...) with 9 different domains, per the Network tab.

32-bit code for this is here: https://searchfox.org/mozilla-central/source/security/nss/lib/freebl/ecl/curve25519_32.c#109
32x32=64 multiply should be easy; 3 multiplies and 2 adds (one of which can be combined with a mul): MUL, MLA, UMULL, ADD. 64x64=128 which probably is needed here is tougher. The code above though doesn't appear to even try to use 32x32=64 though; it sticks to pure int32's.

Per http://conradoplg.cryptoland.net/files/2010/12/gcm14.pdf it's possible to build 64x64=128 multiplies with VMULL in Arm7 (8 VMULL's) and arm8 (1 VMULL) in NEON. See also https://cryptojedi.org/peter/data/ches-20120911.pdf (paper: https://cryptojedi.org/papers/neoncrypto-20120320.pdf)

Would there be any advantage on slow machines (like arm32, perhaps low-end desktops?) in pre-generating a small pool of keys off idle callbacks? Are there any security gotcha's here? (stab in the dark; probably a fair bit of work - but less than trying to recode and verify everything in NEON).

It might even help higher-end/desktop machines (a bit).

Flags: needinfo?(mt)

It seems like there are some low-hanging fruit when it comes to optimization of this code. I would prefer to chase them down first. Caching keygen isn't going to save us from the use of keys, which is just as expensive. And working out when to run those routines could be tricky.

We don't build on systems that don't have uint64_t any more, so we could look at the 32x32=64 optimizations. I agree that chasing down uint128_t is probably going to be more expensive. It all comes down to how much effort you are willing to expend.

Flags: needinfo?(mt)

From looking at a number of these, it seems like each handshake generation takes on the order of 35ms, and loading a page may require doing this many times.

Over the ~13seconds for the entire load, the Socket Thread used almost 900ms in Curve25519_pt_mul

jcj: what are our options here? It looks like this has a serious impact on pageload on many devices

Flags: needinfo?(jjones)
Whiteboard: [qf][geckoview] → [qf:p1:pageload][geckoview]
OS: Unspecified → Android
Hardware: Unspecified → ARM
Whiteboard: [qf:p1:pageload][geckoview] → [qf:p1:pageload][geckoview:fenix:p2]

This seems wholly reasonable to me, too. I've double-checked and there's no HACL* work for a 32bit implementation, sadly, but the optimizations here don't look too hard.

I'm passing the ni? to Thyla for prioritization.

Flags: needinfo?(jjones) → needinfo?(tvandermerwe)
Assignee: nobody → kjacobs.bugzilla
Flags: needinfo?(tvandermerwe)
Status: NEW → ASSIGNED

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)
Priority: -- → P2

I don't know how much help this is likely to be, but this implementation might be worth looking into: https://github.com/briansmith/ring/blob/master/crypto/curve25519/asm/x25519-asm-arm.S

Flags: needinfo?(jjones)
Priority: P2 → P1

The multiplication change is not as simple as I was hoping... Unrolling the multiplication gives a fairly minor (~20%) speedup on my host machine. I've also tested two other implementations: Ref10 and Curve25519-donna. Both are drop-in replacements with significant improvements - the former being an 10-15x faster in my tests.

This was discussed in some detail back when the current implementation went in: https://bugzilla.mozilla.org/show_bug.cgi?id=957105#c40. Given these factors, and the fact that we have no HACL* implementation available, are there any concerns with replacing our 32-bit implementation with Ref10?

Flags: needinfo?(mt)

If we can get Ref10 on reasonable licensing terms (please confirm with mhoye), then I don't see a real problem with taking it for 32-bit machines.

I think that we need to be a little less picky about things when there is a 10x speedup on offer, but we do need to ensure that the licensing is kosher.

Flags: needinfo?(mt)

This patch changes the 32-bit Curve25519 implementation in NSS from ref to ref10. There is no HACL* variant available for 32-bit platforms. A helpful benchmark comparison is here: https://bench.cr.yp.to/impl-scalarmult/curve25519.html

**Notes about this patch: **

  1. lib/freebl/freebl_base.gypi is updated to replace the 64-bit HACL* implementation with this version. This is ONLY for testing on a wider number of platforms and sanitizers. It will be removed before this is marked ready to land.
  2. mhoye has given initial clearance on the licensing (being derived from public domain code), but must approve here before it can land.
  3. This needs to be tested more fully to ensure we don't have subtle incompatibilities. Conversations are ongoing in //#nss-arm32-perf// on how best to accomplish this (since our CI doesn't target arm32 currently).
  4. Performance on a MacOS 64b optimized build improves our x25519 gtest performance (~90 shared secret computations) from ~90ms to ~8ms. Performance improvements will be larger for workloads that use the implementation more heavily or have less set up/tear-down included in the measurement.
  5. LibSodium also implements ref10. Their adaptation checks the public key against a blacklist at time of use ([[ https://github.com/jedisct1/libsodium/blob/master/src/libsodium/crypto_scalarmult/curve25519/ref10/x25519_ref10.c | here ]]). This blacklist is a subset of what we already check against in [[ https://searchfox.org/mozilla-central/source/security/nss/lib/freebl/ecl/ecp_25519.c#27 | ec_Curve25519_pt_validate ]].

See this startup-and-load-CNN profile as well: https://perfht.ml/2QNWAAv (same issue, just a cleaner profile (the simpleperf import function has a bugfix - it was mis-drawing samples int he graph in some cases)

a preliminary look with the patch applied appears to show significantly less overhead for 25519: https://perfht.ml/2QOdi2t

Attachment #9069810 - Attachment description: Bug 1550579 - Replace arm32 curve25519 ref implementation with ref10 → Bug 1550579 - Replace arm32 curve25519 ref implementation with fiat-crypto

esr68=affected because we might want to uplift this optimization to Fennec ESR 68.1.

Whiteboard: [qf:p1:pageload][geckoview:fenix:p2] → [qf:p1:pageload][geckoview:fenix:p2] [bcs:p1]
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.45
Blocks: 1563279
Blocks: 1563280

status-firefox69=fixed by NSS 3.45 in bug 1550889.
status-firefox-esr68=wontfix because we don't plan to uplift these NSS changes to Fennec ESR 68.

Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload][geckoview:fenix:p2] [bcs:p1] → [geckoview:fenix:p2] [bcs:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: