Open Bug 1613236 Opened 5 years ago Updated 1 years ago

freebl: POWER poly1305 mac vector acceleration

Categories

(NSS :: Libraries, enhancement, P5)

Other
Unspecified
enhancement

Tracking

(Not tracked)

People

(Reporter: johnjmar, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [nss-fx])

Attachments

(3 files, 3 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0

Expected results:

Would like to reach community to help improve the performance of Poly1305
using POWER8 and POWER9 ISA enhancements, and to provide proof that
achieved performance of Poly1305 is close to optimal for the platform
(i.e matching or beating OpenSSL performance *)

Optimized assembly implementations in the Cryptogams repository [1] may serve
as useful reference. C-based intrinsic implementations are also acceptable
given prior conditions (*).

Financial bounty upon completion, community acceptance of patches, and reviewer
input.

Moreover, free access to a ppc64le VM or direct Jenkins CI account can be
obtained at OSU Open Source Lab: https://osuosl.org/services/powerdev/
You can list me (johnjmar@linux.vnet.ibm.com) as the IBM Advocate.

[1] https://github.com/dot-asm/cryptogams/

Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P5
Hardware: Unspecified → Other

Hi. The original submitter of this bug, John Martinez, left IBM. For now, I'll be the designated contact from IBM.

Note that we decided to bring support for vectorization for the IBM platforms directly via HACL*.

B. Beurdouche within : https://bugzilla.mozilla.org/show_bug.cgi?id=1613235#c28

I don't think we would take the assembly for Poly1305, but fortunately no patch was provided and there is vectorization support in progress on the HACL* side, so I am sure that this code will be pretty fast...

related: https://github.com/project-everest/hacl-star/issues/404

@gcwilson, it looks like this issue (and the related bounty: https://www.bountysource.com/issues/88058247-freebl-power-poly1305-mac-vector-acceleration) are superseded by

issue: https://github.com/project-everest/hacl-star/issues/404
bounty: https://www.bountysource.com/issues/96585648-enable-build-system-and-optimize-libintvector-h-for-ppc64le-linux

If so, then this issue (and the related bounty) should be closable.

(@aoeuh, you may take a look at the hacl-star bounty, as you know, sometimes bounties are started but never finished, so possibly you can still solve-and-claim the hacl-star bounty.)

Hi Benjamin,
Bringing vectorized implementations from HACL* is good, specifically with the upcoming ppc64le support of libintvector.h would bring a considerable enhancement to the Poyl1305 mac on ppc64le. However, HACL* vectorization via libintvector.h uses one template to vectorize algorithms for all supported architectures using pre-defined C intrinsics which doesn't yield the same performance margin among supported architectures because the way out-of-order executution (OoOE) working differs from one architecture to another. There are cases in HACL* vectorized template where data alternates between vector and scalar which hold the performance a bit back on powerpc architecture. I made a benchmark of assembly implementation and HACL* vectorization on POWER9 to determine the performance gap, I got 1.45 cycles per byte for Poly1305 update function of native implementation and 3.88 cycles per byte for Poly1305 update function of HACL* vectorization. With that said, could you reconsider implementing Poly1305 mac the same way previous algorithms have been optimized to get the optimal performance on PowerPC architecture?

Hi Mamone,
The current goal of NSS for its low level crypto is to focus on memory safety, functional correctness and secret independence.
To achieve these goals, we are trying to remove as much code as possible that is handwritten. Especially, we try to reduce the duplication of code between architectures and focus on having a very limited number of implementations which can provide stronger guarantees and can be audited more easily and frequently than the current code. In that process, performance is not something we are looking at much, except to make sure we don't cause significant regressions.

For the specific case that you are mentioning, we already have high quality code for it in HACL* and the vectorization via libintvector.h would provide unoptimal, yet good enough performance for NSS. However, I think it would be interesting for you and George to discuss your observations with my HACL* co-authors [0] and see if they could generate a different version of the code which could directly leverage calls to the power assembly and intrinsics without the "manual" code duplication that we have in the Chacha20Poly1305 case [1].

[0] hacl-star-maintainers@lists.gforge.inria.fr
[1] https://searchfox.org/nss/source/lib/freebl/chacha20poly1305-ppc.c

Flags: needinfo?(bbeurdouche)
Flags: needinfo?(maamoun.tk)
Blocks: hacl-star
Flags: needinfo?(bbeurdouche)

Hi Benjamin,

I am from IBM and work with George Wilson. I am curious on the next plan for getting Poly1305 code into NSS for ppc64le.
I understand that the idea is to take the code from hacl-star. Do you think the code is now good enough there to pull from ? Or is there some development left in hacl-star before it can be taken into this ?

Thanks & Regards,
- Nayna

Redirect a needinfo that is pending on an inactive user to the triage owner.
:beurdouche, since the bug has recent activity, could you please find another way to get the information or close the bug as INCOMPLETE if it is not actionable?

For more information, please visit auto_nag documentation.

Flags: needinfo?(maamoun.tk) → needinfo?(bbeurdouche)

[PowerPC] Process leftovers and message length in poly1305 update

Attachment #9296134 - Attachment is obsolete: true

I pushed a patch to phabricator that enables 128-bit vector Poly1305 implementation from HACL* portion for PowerPC arch. It implies the following PowerPC-specific features.

  • Import vectorization directives from HACL* upstream to libintvector.h fork in freebl.
  • Enable 128-bit Poly1305 cores in GYP and Make build systems.
  • Utilize 128-bit poly1305 implementation and remove legacy 32-bit one.

Performance test (measuring Chacha20_Poly1305 encryption performance for a 16KB buffer, repeated 10000 times)

bltest -E -m chacha20_poly1305 -p 10000 -o /dev/null -v tests/chacha20_poly1305/iv0 -k tests/chacha20_poly1305/key0 -b 16384

(Upstream. VSX implementation of Chacha20/32-bit Poly1305 ported from HACL*)
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_poly1305_e 156Mb 256Kb     10T       0       0.000     360.000       0.360  434Mb 28Kb

(This patch. Same VSX implementation of Chacha20/128-bit Poly1305 ported from HACL*)
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_poly1305_e 156Mb 256Kb     10T       0       0.000     340.000       0.341 458Mb 216Kb

The patch passes all relative tests that in tests/all.sh on POWER9.

Severity: normal → S3

Hi all. Is there anything we can to to facilitate review of Mamone's work?

Flags: needinfo?(rrelyea)

Hi,

what's the difference between the two patches?

Hi,

I haven't figured out yet how to push commits on top of existing patch so I have a standalone patch for every update . You need to ignore the old patch that marked as legacy and consider the last one.

Phabricator can handle new updates, you just need the phabricator link in the comment of the commit, and use the mozphap command you use to create the initial patch. I use hg 'q' interface to create a single commit that keeps the same commit comment (Phabricator automatically adds the phabricator link to your commit comment.). If you use git, I believe you can use the git squash to merge your commit. You just need to preserve the phabricator link from the initial commit in your new squashed commit.

When you push a patch, be sure to update the reviewers and request a review.

Flags: needinfo?(rrelyea)

Is there any update on this issue?

Flags: needinfo?(nkulatova)

Hi,

we are currently working on the chain of patches that update the currently supported HACL* code. One of the patches also update the libintvector.h library. So, let's do the following:

  1. we will try to have the patches submitted as soon as we can
  2. I would ask you to rebase the code to support the modifications
  3. I proceed with this patch.

Does it work for you?

Flags: needinfo?(nkulatova)

It does. You can let me know once the patches are merged into the mainstream so I can rebase the code.

Flags: needinfo?(bbeurdouche)
Whiteboard: [nss-fx]

@nkulatova Have you got any progress on updating HACL* code?

Flags: needinfo?(nkulatova)

Actually, yes. The patch has just been landed.

Flags: needinfo?(nkulatova)
Attachment #9296171 - Attachment is obsolete: true

Thank you for updating HACL* base code. I updated the patch to remove HACL* import part as well it originally cuts a fair amount of legacy code that imported improperly from HACL*, reviewing the changes is pretty straightforward with the current shape.

Flags: needinfo?(nkulatova)

Lovely, looking forward to reviewing the code!

Flags: needinfo?(nkulatova)
Attachment #9329907 - Attachment is obsolete: true
Attachment #9329906 - Attachment is obsolete: true
Attachment #9329491 - Attachment is obsolete: true
Attachment #9329491 - Attachment is obsolete: false
Attachment #9329906 - Attachment is obsolete: false

Hi, could you precise which out of 4 patches we need to take a look at? All 4? the latest?

Flags: needinfo?(maamoun.tk)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: