NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2

Chi-Ming Marvin Chung; Vincent Hwang; Matthias J. Kannwischer; Gregor Seiler; Cheng-Jhih  Shih; Bo-Yin  Yang

doi:10.46586/tches.v2021.i2.159-188

Authors

Chi-Ming Marvin Chung Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Vincent Hwang Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Matthias J. Kannwischer Max Planck Institute for Security and Privacy, Bochum, Germany
Gregor Seiler IBM Research – Zurich, Rüschlikon, Switzerland; ETH Zurich, Zurich, Switzerland
Cheng-Jhih Shih Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Bo-Yin Yang Academia Sinica, Taipei, Taiwan

DOI:

https://doi.org/10.46586/tches.v2021.i2.159-188

Keywords:

Polynomial Multiplication, NTT Multiplication, Saber, NTRU, Cortex-M4, AVX2

Abstract

In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).
For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.
As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.

NTT Multiplication for NTT-unfriendly Rings

New Speed Records for Saber and NTRU on Cortex-M4 and AVX2

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

iacr-logo