NTT Multiplication for NTT-unfriendly Rings

In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT). For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets. As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.


Introduction
Popular PKC primitives like RSA and elliptic curve cryptography (ECC) which are based on the hardness of factoring large integers and the discrete logarithm problem both are vulnerable to quantum computer attacks [Sho94]. Thus we often hear that Quantum Computers (QCs) may arrive soon and break all common Public-Key Cryptography (PKC) today.

Saber
Saber [DKRV19] is a lattice-based key encapsulation mechanism based on the Module Learning With Rounding M-LWR problem. The polynomial ring used within Saber is R q = Z q [x]/(X n + 1) with q = 2 13 and n = 256 across all parameter sets. As most other lattice-based schemes, Saber constructs a CCA-secure KEM from a CPA-secure DPKE.

Algorithm 1 Saber Key Generation
Output: pk = (seed A , b), sk = (s) Algorithm 3 Saber CPA Decryption Input: ct = (c, b ), sk = (s) Output: m Algorithm 2 Saber CPA Encryption Input: m, r, pk = (seed A , b) Output: ct = (c, b ) Algorithm 1, Algorithm 2, and Algorithm 3 depict the CPA-secure key generation, encryption, and decryption respectively. Sample U refers to sampling from a uniform distribution, Sample B refers to sampling from a binomial distribution. Expand expands a seed to a uniform matrix of polynomials. We omit the CCA variants for brevity and refer the reader to the specification for the corresponding CCA transformation. Saber's most time-consuming operation in key generation and encryption is the matrix-vector multiplication of polynomials A T · s and As . In decryption the most expensive operation is the inner product of b T · s.
Parameters The Saber submission specifies the three parameter sets Lightsaber, Saber, and Firesaber targetting the NIST security levels 1, 3, and 5 respectively. While the underlying polynomial ring remains the same for all parameter sets, the module dimension l, the rounding parameter T , and the secret distribution parameter µ vary per parameter set. The parameters are summarized in Table 1a.
CCA Transform To achieve IND-CCA2 security, Saber is using a variant of the Fujisaki-Okamoto (FO) transform due to Hofheinz, Hövelmanns, and Kiltz [HHK17]. However, as the randomness r (and the corresponding s ) cannot be recovered in decryption, Saber does require re-encryption in the decapsulation algorithm. Hence, improving the encryption also improves decapsulation. For technical details on the FO transform, refer to the specification [DKRV19].
The algorithms for key generation, encryption, and decryption are shown in Algorithm 4, Algorithm 5, and Algorithm 6 respectively. For the details of Sample and Lift, see NTRU's main benefit is the relatively cheap encapsulation which is the fastest of the KEM finalists in the NIST competition. However, it comes with a rather costly key generation procedure as it requires polynomial inversion. In both encryption and decryption, the major arithmetic operation is polynomial multiplication.

LAC
] is a lattice-based key encapsulation mechanism based on the Ring Learning with Errors problem. The polynomial ring used in LAC is R q = Z q [x]/(X n + 1) with q = 251 and n = 512 for LAC-128 and n = 1024 for LAC-192 and LAC-256. As most other lattice-based schemes, LAC constructs a CCA-secure KEM from a CPA-secure DPKE. Algorithm 7, Algorithm 8, and Algorithm 9 depict the CPA-secure key generation, encryption, and decryption respectively. Sample U refers to sampling from a uniform distribution, Sample B refers to sampling from a fixed-weight ternary distribution. Sample B refers to sampling from a ternary distribution. Expand expands a seed to a uniform matrix of polynomials. (·) lv means to take the first l v coefficients of a polynomial as a vector. We omit the CCA variants for brevity and refer the reader to the specification for the corresponding CCA transformation. LAC's major operations are multiplications (as, ar, br, c 1 s).

Algorithm 7 LAC Key Generation
Output: pk = (seed a , b), sk = (s) Parameters The LAC submission specifies the three parameter sets LAC-128, LAC-192, and LAC-256 targetting the NIST security levels 1, 3, and 5 respectively. The parameters are summarized in Table 2.

FFT-based Polynomial Multiplications and NTT
In NTRU, LAC, and Saber, we need to multiply in the following rings: In Saber, we actually need more: a matrixvector product and an inner product based on that ring multiplication. We describe below tools to construct those multiplications. Using Fast Fourier Transforms for multiplication is a common technique. Let R be the base ring. The basic idea is the Chinese Remainder Theorem: if f, g are co-prime then we can write down the ring isomorphism φ : In short, the FFT multiplication trick is h 1 h 2 (mod (x 2n − a 2 )) ≡ (h 1 mod (x n + a)) (h 2 mod (x n + a)). If we consider this an "in-place" operation with a size-2n array of elements of R representing an element of R[x]/ x 2n − a 2 and the bottom and top half of that array representing the element of R[x]/ (x n − a) and of R[x]/ (x n + a) respectively, then with a little change of notation we see the standard "butterfly" transformations ( Figure 1).
To multiply h 1 h 2 with deg h 1 , deg h 2 < n in an initial ring that is not an easy quotient ring for NTTs, such as in NTRU, we would first consider everything as a polynomial multiplication in R[x], then R[x]/ x 2n − 1 , where n > n is a convenient order for NTTs.
The "layer 0" in the FFT is then trivial. Interested readers may refer to [Ber] for various tricks of polynomial multiplications.

The "Standard" Cooley-Tukey (CT) NTT-Based Multiplication
"The FFT" usually means to split everything down from x 2 k − 1 to linear factors, which requires a primitive root ζ ∈ R of degree 2 k (an element such that ζ 2 k−1 = −1). The Number Theoretic Transform (NTT) means an FFT in a prime field, which must be modulo an "NTT-friendly" prime of the form p = 2 k p + 1.
One might invert the FFT trick by using Gentleman-Sande butterflies, replacing all the divisions by 2 with a final division by 2 k . We may also treat this as a forward FFT map (again followed by a final division by 2 k ) implementing it with Cooley-Tukey butterflies. This frequently requires extra data movement on larger platforms, but often can be convenient on a small micro-controller like the ARM Cortex-M4.
Note that in a standard Fourier analysis situation, one needs to map with the powers of ζ appearing in order, whereas doing layers of Cooley-Tukey butterfly operations leaves the results in bit-reversed order. But the order is not important when all we care about is to multiply two polynomials. Schemes that are specifically designed to use the NTT like Kyber, NewHope, and Dilithium usually build the bit-reversed order into their specifications [ABD + 19, LDK + 19, PAA + 19].
If we define the bit reversal brv n n−1 i=0 a i 2 i := n−1 i=0 a i 2 n−1−i , then the j-th (of 2 i ) multiplier used in a standard NTT based on CT butterflies of layer i in a k-layer NTT, for 2 k−1−i butterflies starting at index j · 2 k−i , for entries 2 k−1−i apart, is ζ brv k−1 (j) .

The "Twisted" Gentleman-Sande (GS) NTT-Based Multiplication
"The Twisted FFT trick" means to do the FFT trick on R[X]/(X 2n − 1) to split it down to the rings R[X]/(X n ∓ 1) then map R[X]/(X n + 1) to R[Y ]/(Y n − 1) where Y = ζX with ζ n = −1 the primitive root of order 2n. When n = 2 k−1 and our prime is NTT-friendly, we can split it down to 2n (R[x]/(x − 1)) × · · · × (R[x]/(x − 1)). One can verify that the result is componentwise the same as the more familiar form above.
The butterflies appearing in such a forward NTT would all be of the GS variety. The j-th (of 2 k−1−i ) multiplier used in layer i in a k-layer NTT, for butterflies for 2 i pairs of entries spaced 2 k−1−i apart, starting at index j and incrementing by 2 k−i , is ζ j·2 i .
Note that doing backward NTTs using CT butterflies as in Sec. 2.4.1 is properly considered the reverse of the Gentleman-Sande NTT and is often used to avoid reductions.

NTT-based Negacyclic Convolutions
The "Negacyclic convolution" means to multiply modulo x n + 1. When n = 2 k−1 and the ring R contains a primitive root ζ of degree 2n, the full negacyclic NTT is just the top half of a full FFT mod(x 2n − 1). I.e., the j-th (of 2 i ) multiplier encountered in layer i, for 2 k−1−i CT butterflies starting at index j · 2 k−i , for indices 2 k−1−i apart, is ζ brv k (j+2 i−1 ) .
Here starting by "twisting" into mod(x n − 1) costs an extra layer of multiplications.

Incomplete NTTs
If we proceed with the FFT trick for layers on multiplicands (coming down to 2 polynomials (mod x h − ζ i )), do pairwise modular multiplications using schoolbook on h entries, then invert the FFT trick, we have performed a multiplication by incomplete NTT.

Good's Trick and NTTs
Instead of the incomplete NTT, if the length of the NTT is n = h · 2 k with h odd, we apply Good's FFT trick [Goo51], where we set x = yw with y 2 k−1 = −1 = w h−1 + w h−2 + · · · + w.
y]/(y 2 k − 1). We term "Good's permutation" the map from the array We follow Good's permutation with a size-2 k FFT w.r.t. y on each multiplicand, represented by h parallel size-2 k NTTs. Then we do "point" multiplication by convolving together degree-(h − 1) polynomials in w modulo w h − 1, do an inverse size-2 k FFT (represented by h inverse NTTs), and then finally undo Good's permutation.

NTTs with Modulus not of the form 2 k p + 1
Suppose we have a convolution modulo x n − 1 modulo q, where n |(q − 1). We can express the polynomials with coefficients in [− q 2 ; q 2 ) and compute the convolution as a polynomial of integer coefficients. The absolute magnitude of the resulting coefficients would be at most nq 2 /4. Therefore, if we find a prime p > nq 2 /2 such that n|(p − 1), and compute the multiplication mod p (which we can using NTTs of length n mod p), then the result must be correct as a polynomial with integer coefficients, and then we can recover our correct result modulo q.
The procedure is quite similar if it is a different kind of convolution or another product. In the case of our applications (Saber and NTRU), one of the multiplicands is usually "small" so that we can use an even smaller prime.

Mixed-Radix NTT for Multiplications
In [CT65] Cooley and Tukey explained how to effectively compute a general FFT of a composite size N . In such cases, the FFT operation can be realized by combining the results of N/p smaller FFTs on vectors of size p. These elementary FFT operations over vectors of prime size are also referred to by the shape of their diagrams as butterflies. After such subdivisions, the immediate output of the algorithm would appear in an order different from that of the input (however, like the radix-2 case, that needs not concern us).

Multiple Moduli and the Explicit CRT (Divided Difference Form)
As in Section 2.4.6 suppose we have a convolution modulo x n − 1 modulo q, where n |(q − 1). A different possibility is to take various NTT-friendly primes p i whose product P is sufficiently large (usually > nq 2 /2). Clearly computing the multiplication mod P must return the correct product as polynomial with integer coefficients. This we can do by computing the product modulo each p j using NTTs. There are at least two methods to put the pieces together modulo P , from which we can compute our correct results. One is via the Explicit Chinese Remainder Theorem [BS07]. The other is the following approach: The theorem is also true for noncentered mod and is faster than [BS07] for small s.

Cortex-M4
As selected by NIST for evaluating PQC candidates on micro-controllers, the ARM Cortex-M4 (or M4F since a floating point unit is assumed) is one of our target platforms for implementing PQC schemes. It is a RISC microcontroller which has fourteen 32-bit general purpose registers. Its instruction set has several unusual features: Single Cycle: Most instructions take 1 cycle each, including 32 × 32 + 64 = 64-bit MADD. The most conspicuous exception is the first load in a sequence of loads (2 cycles).

Barrel Shifter:
In almost all instructions one of the operands can be shifted or rotated by an arbitrary number of bits at no extra cost (and even sometimes help set carry).
Flexible Indexing: A registers may be incremented before or after being used as an address for a load. The sum of two registers, one possibly shifted, maybe used as the address.
SIMD: Some arithmetic instructions operate on 8-and 16-bit chunks of registers.
Restricted immediates: Not all 32-bit numbers can be used as an immediate operand.
Optional Flag-setting: Instructions don't set flags by default, though most optionally do.
We mention some of the main tricks which we employed below.
Reductions. Since all of our approaches on Cortex-M4 are 32-bit NTTs, we need 32-bit modular reductions. We implement 32-bit signed Barrett reduction and 32-bit signed Montgomery multiplication with Cortex-M4's powerful 1-cycle long multiplications smull, smlal, and smmulr. Fix an integer a, a modulus q, and let R = 2 32 . For Barrett reduction, ·q in 2 cycles as shown in Algorithm 10.
For Montgomery multiplication, we compute (cf. Algorithm 11). When b is a known constant, we may precompute bR mod ± q and derive ab mod ± q instead. While computing the point multiplication for NTTs using schoolbook multiplication, neither multiplicand is known beforehand, but we may cancel out the R −1 mod ± q by Montgomery-multiplying the precomputed NTT −1 N R 2 mod ± q (i.e. an extra factor of R) at the end of NTT −1 . For convenience, we denote the computations (a, b) → abR −1 mod ± q by montgomeryM and a(64-bit) → aR −1 mod q(32-bit) by montgomeryR.

Algorithm 10 Barrett reduction
Algorithm 11 Montgomery multiplication Floating-point registers. On the Cortex-M4 as specified by NIST, there are 14 general purpose registers and 32 floating-point registers. Floating-point registers not only enable us to access frequently used twiddle factors but also give us great flexibility on designing our approaches; loading twiddle factors to floating-point registers before loops for butterflies saves a general purpose register (crucial, since all our implementation on Cortex-M4 are 32-bit NTTs), we have a slighty faster implementation for ntruhps2048509 comparing to Toom-4 implementation and a faster variant to use 64-bit (vs. the usual 32-bit) accumulators for Saber's matrix-to-vector product. Details for these variants will be elaborated in the relevant sections.

NTTs on the Cortex-M4
On Cortex-M4, we commonly compute three layers of radix-2 NTTs at a time, Algorithm 19 illustrates the idea and is adapted from [ACC + 20].

Saber
For Saber, we replace polynomial multiplications in the subroutines InnerProd and MatrixVectorMul using the negacyclic NTT trick to eliminate all Toom-4 multiplications in Saber. In the interest of brevity, we only detail MatrixVectorMul (which takes most of the time) that multiplies an l × l matrix with an l × 1 vector, where each component is an The design of Saber provides additional incentives to use NTTs because the matrix-to-vector product is turned into a matrix-to-vector point-multiplication in NTT domain. More concretely, we do not merely save the difference in cycles between Toom-4 and NTT-based degree-255 polynomial multiplications, because to compute the l 2 multiplications in MatrixVectorMul, we only need to compute l 2 + l NTTs and l NTT inverses instead of 2l 2 NTTs and l 2 inverse NTTs as normally might be expected. Our NTT-based MatrixVectorMul therefore proceeds as follows: compute the size-256 negacyclic NTT for each component in the matrix and the vector, multiply the matrix by the vector with degree-3 schoolbooks, accumulate the result to a vector, and then compute the NTT inverses for each component.
We compute incomplete NTTs and degree-3 schoolbook as it gives the best performance for Saber. To compute a 0 · b 0 + a 1 · b 1 + a 2 · b 2 (mod q ) using Montgomery reductions, we only need 1 smull, 2 smlal, and 1 Montgomery reduction instead of computing 3 multiplications, each followed by a Montgomery reduction, adding the results together, and then reducing modulo q again. Furthermore, this idea also applies where each a i · b i is a degree-3 schoolbook.
Choosing the best incomplete NTT. When using incomplete NTTs we need to choose the point at which we stop doing NTT butterfly operations and simply multiply the polynomials using school-book multiplication. One can choose between 8 layers of NTTs, 7 layers of NTTs followed by 2 × 2 schoolbook, 6 layers of NTTs followed by 4 × 4 schoolbook, and 5 layers of NTTs followed by 8 × 8 schoolbook. First we compare the behavior of incomplete NTTs. On a Cortex-M4, among the 14 general purpose registers, we need one register for loading coefficients, one register for loading the twiddle factors ζ, two registers for constants used in Montgomery multiplication, two registers as temporary storage for Montgomery multiplication in schoolbook. There are only 8 remaining registers where computing 3 layers of NTTs at a time could be achieved without overhead. Computing 5 layers of NTTs would not achieve the economical use of registers, since we can often compute the 5-th layer without spilling the registers. Computing 7 layers of NTTs would involve a lot of vmovs because of the lack of registers. For Saber, we achieve the best performance when doing 6 layers of NTTs. This can be explained by comparing 4 × 4 school-book multiplication and size-4 NTTs. For simplicity, we will focus on an l-dimensional matrix-to-vector product in which each component is a degree-3 polynomial. A 4 × 4 school-book multiplication requires 7 smulls, 12 smlals, and 7 montgomeryRs as illustrated in Algorithm 17 in the Appendix. For accumulation, each l-dimensional row-column inner product requires 4l − 4 adds and 4l − 4 vmovs for temporary storage. Therefore, 41l 2 − 8l cycles are required for the 4 × 4 school-book approach. To use a size-4 NTT trick, we calculate the size-4 NTT of each component, multiply components by components with point-multiplication, accumulate to a vector, and finally, compute the size-4 NTT inverse of each component. Each size-4 NTT requires 4 montgomeryMs and 4 add-sub pairs as shown in Algorithm 16. Since only 14 registers are available, we need to vmov 4 · (2 · (l − 1) + l) · l = 12l 2 − 8l times for storing intermediate values for accumulation. If the NTT trick is adopted, the matrix-to-vector product would require 20l 2 + 40l cycles for l 2 + l NTTs and l NTT inverses, l 2 montgomeryMs, 12l 2 − 8l vmovs, and 4l 2 − 4l adds, resulting in 39l 2 + 28l cycles. We have 41l 2 − 8l < 39l 2 + 28l for l < 18.

Better Accumulation For Schoolbook Multiplication.
There is an even better approach to matrix-to-vector products utilizing the commutativity of instructions. All adds and some montgomeryR can be removed at the cost of some additional vmovs. Consider the 64-bit value of [z 0 ]h and then reduce it to 32-bit with montgomeryR, wherein all the adds can be absorbed (changing some smull into smlal). Algorithm 18 is an illustration of the idea. To summarize, we save 4l 2 − 4l adds and 4l 2 − 4l montgomeryRs (each of which takes 2 cycles) at the cost of 8l 2 − 8l vmovs and therefore the cycle count becomes 37l 2 − 4l, smaller than 39l 2 + 28l for all l. We find that the above approach is hard to beat regardless of l.
Our Optimized Negacyclic NTT Trick. Since incomplete size-256 negacyclic NTTs are computed, we choose prime q = 25166081 = 196610 · 128 + 1 for Saber and Firesaber, and prime q = 20972417 = 163847 · 128 + 1 for Lightsaber. We compute NTTs with six layers of radix-2 NTTs (CT butterflies), where the first three layers are merged and the following three layers are merged, then compute schoolbook-and-accumulate with above strategy, and finally compute incomplete size-256 NTT negacyclic inverses using GS butterflies with the same 3-layer-merge.

NTRU
In this section, we go into implementation details for polynomial multiplication in NTRU on Cortex-M4. We are targeting the first two poly_Rq_muls and the first poly_Sq_mul in key generation, the poly_Rq_mul in encryption, and the first poly_Rq_mul in decryption. While implementing polynomial multiplication for each parameter set, we optimized the code in various aspects. Some ideas work for all parameter sets, and some are only suitable for a particular one. The core ideas are simple: manipulate registers wisely, compute small convolutions with schoolbook, and change the domain only when needed. We summarize the tricks used for each parameter set in Table 3.

Layers of NTT.
As usual, several layers of NTTs are computed at a time to avoid load-stores and use the registers economically. On Cortex-M4, since only 14 general purpose registers are available, we compute three layers of radix-2 NTTs (and two layers of radix-3 NTTs) at a time. For ntruhps2048509, we employ a seemingly strange alternative, computing four layers of radix-2 NTTs at a time, to set up better a foundation for polynomial multiplication. This results in a slightly faster implementation for ntruhps2048509 compared to the Toom-4 approach.
2 × 4-layer-radix-2 4 × 4 2 × 3-layer-radix-2 Tricks for commutative operations. Recall that for computing an NTT, we must cancel out the scaling factor NTT N . We can halve the number of Montgomery-multiplications by NTT −1 N R 2 mod ± q by first reducing modulo the polynomial modulus and then performing the multiplication. The same idea also applies to the operations of reducing the coefficient from Z q to Z q and packing two coefficients into one register. Because they commute, we pack two coefficients and then and with (q − 1)||(q − 1). Algorithm 20 shows how the ideas are implemented at the final stage. ntruhps4096821. Algorithm 12 depicts the NTT for ntruhps4096821. We compute incomplete mixed-radix size-1728 NTTs for each polynomial by splitting down to x 3 i,j − ζ i,j , multiply degree-2 polynomials with schoolbook, derive incomplete mixed-radix NTT inverses, and then reduce the coefficient ring to Z q . For incomplete size-1728 NTTs, we first compute size-9 NTTs with two radix-3 NTTs for each 9-set distanced apart by 192 units. Next, for each consecutive 192 coefficients, we compute size-64 NTTs with six layers of radix-2 NTTs for each 64-set distanced apart by 3 units, leaving degree-2 polynomials. Among 9 sets of 192-coefficient, standard size-64 NTTs are computed for the first 192coefficient and twisted size-64 NTTs are computed for the rest. The incomplete size-1728 NTT inverse is computed in the reversed manner. For the final stage, we employ all the ideas mentioned in the previous paragraph -taking quotient before Montgomery-multiplying (R) 2 NTT −1 N mod q and pack two coefficients before the and. For merging layers, the two layers of radix-3 NTTs are merged, the first three layers of radix-2 NTTs are merged, the following three layers of radix-2 NTTs are merged, and the NTT inverses are merged in the same manner. ntruhrss701 and ntruhps2048677. Algorithm 13 shows the NTT used for ntruhrss701 and ntruhps2048677. We use Good's trick for both. Our approach is almost the same as [ACC + 20], with a slightly faster final stage. This is because (mod 2 k ) and (mod (x n − 1)) are cheaper. We employ Good's permutation of size 3 × 2 9 for the size-1536 NTT. The algorithm goes in the following order: compute three size-512 NTTs (CT butterflies), each for 512 contiguous entries, compute 3 × 3 convolutions, where coefficients are distanced apart by 512 units, invert size-512 NTTs (CT butterflies), and a final stage. This last stage consists of: inverting Good's permutation, taking the remainder mod(x n − 1), Montgomery-multiplication by (R) 2 NTT −1 N mod q , packing two coefficients into one register, and reducing to coefficient ring Z q . We implement the iNTT using CT butterfiles because we need fewer reductions to avoid overflows. As mentioned above, we do mod(x n − 1) first so we save half the Montgomery-multiplications by (R) 2 NTT −1 N mod q .
ntruhps2048509. We merge our NTT layers differently for ntruhps2048509 to provide a better framework for polynomial multiplication. See Algorithm 14 for the details. We do 2 sets of four-layer NTTs (CT butterflies) for incomplete size-1024 NTTs, perform each 4-coefficient (modulo a degree-3 polynomial) multiplication with schoolbook, do 2 sets of 3-layer NTT inverses (GS butterflies), and a final stage. Here GS butterfles make for an easier final stage comprising the following operations: 2 layers of NTT inverses, taking mod(x n − 1), Montgomery-multiplication by (R) 2 NTT −1 N mod q , packing two coefficients into one register, and reducing to coefficient ring Z q . This approach saves 1 layer of load-stores.
NTT trick for LAC. We employ the negacyclic NTT trick on the rings Z q [x]/(x 512 + 1), , and LAC-256, respectively. Our approach for LAC-128 proceeds as follows: compute the negacyclic size-512 NTTs of polynomials, do point-by-point multiplications, and finally, compute the size-512 NTT inverse. Our approach for LAC-192 and LAC-256 proceeds as the follows: derive incomplete negacyclic size-1024 NTT by three sets of 3-layer-radix-2 NTTs, compute 2 × 2 schoolbooks, and invert the NTT.
On the "optimized implementation" in the LAC submission. The original LAC "optimized" code stores small polynomials as arrays of the indices of the non-zero terms (and do secret-dependent table lookups), and they use the C % operator. These operations are not constant time, posing a security risk. We use a standard form for the array to do NTTs, and replace all C % operator with Barrett reductions to obtain constant-time implementation.

Vectorized NTT on AVX2
For fast NTT-based polynomial multiplication on current x86 processors from Intel and AMD, it is necessary to use a vectorized implementation of the NTT. These processors support the AVX2 instruction set, offering a large number of instructions that operate on 16 vector registers, each of length 256 bit. Kyber, NTTRU, and Dilithium.

Fast mulmods
A first obstacle towards fast vectorization of the NTT is the problem of efficiently multiplying many coefficients modulo a small prime q. The standard way to compute modular products is to first compute the double-length products over Z, and then reduce these intermediate results modulo q. In a vectorized implementation, in order to achieve the highest possible throughput, one wants to pack as many coefficients as possible in a vector register. But double-length intermediate products mean it is only possible to achieve half the density compared to packing only mod-q reduced integers. This effectively reduces the speed of the implementation by a factor of two. Note that this is not a problem when Algorithm 15 Multiplication modulo 16-bit q signed high product 2: t 0 ← ab mod 2 16 signed low product 3: t 0 ← t0q 2 16 signed high product 4: r ← (t 1 − t 0 ) mod 2 16 computing products modulo a two-power as in other polynomial multiplication implementations for Saber or NTRU that directly operate over the respective polynomial rings. There the binary arithmetic in modern CPUs automatically takes care of the modular reduction. To overcome this obstacle we use the modified Montgomery reduction algorithm from [Sei18] together with the improvement from [LS19]. Here the modular multiplications are computed from separate intermediate low and high half-products. When using the AVX2 instruction set, this approach is most efficient for 16-bit primes q. The reason is that there is a specific high-only half-product instruction vpmulhw for packed 16-bit integers that does not have an equivalent instruction for packed 32-bit integers. Therefore, unlike on the Cortex-M4, we use NTTs modulo 16-bit primes q on AVX2. Then we need to use a multi-modular approach and compute the polynomial products modulo two such primes so that we are able to correctly lift the results to Z with the help of the Chinese remainder theorem. The additional polynomial product modulo a second prime involving three NTT computations and a base product computation does not result in reduced speed, because this loss of a factor of two is completely compensated for by twice the throughput from packing 16-bit integers instead of 32-bit integers. Another benefit of 16-bit primes is that it is possible to compute the occasional product modulo 3 in NTRU more efficiently, but we haven't used this improvement in our experiments.
We state the modular multiplication algorithm in Algorithm 15. As inputs it gets a 16-bit integer a, and a mod-q reduced integer b together with the precomputed b = bq −1 mod 2 16 . The algorithm then outputs a representative modulo q for the scaled product ab2 16 mod q. The second multiplicand b is always a fixed constant in the NTT and hence b and the corresponding element b can easily be precomputed. The scaling factor 2 16 is handled as usual by precomputing b and b with an additional factor of 2 −16 .

Choice of Transforms
We considered several different choices of transforms. For Saber with its NTT-friendly polynomial modulus X 256 + 1, we compute the negacyclic length-256 transforms modulo X 256 +1 as we do on the Cortex-M4. For performing only a single polynomial multiplication it is usually advantageous to use an incomplete NTT but for Saber where in the matrixvector product the vector of polynomials only needs to be transformed once and the inner products can be computed in the NTT basis, a complete NTT is preferable. In the case of ntruhps2048677 and nthuhrss701 we compute an incomplete NTT modulo X 1536 − 1 where we do 9 radix-2 splitting down to factors of degree 3. Since the input polynomials have degree less than 768, the first splitting is for free. For ntruhps2048509 and ntruhps4096821 the same approach that we use on the Cortex-M4 should also gives good results on Skylake. In particular, a length-1728 NTT with two radix-3 splittings, followed by 6 radix-2 splittings, down to polynomials of degree less than 3. For LAC with its polynomial moduli X 512 + 1 and X 1024 + 1, we compute incomplete negacyclic length-512 and length-1024 NTTs, respectively, each with 8 layers, coming down to factors of degree 2 and 4.
We chose the prime moduli 7681 and 10753 for the NTTs of length 256, 512, 1024 and 1536. Their product is slightly longer than 26 bits, which is enough for all our applications. In the case of Saber, the absolute value of the polynomial coefficients when computing the matrix-vector product over Z is bounded by 2 24 , which is below 2 25 . In NTRU, the maximum absolute value is attained in ntruhrss701, where the coefficients are bounded by 2 24.04 in all products of a uniform polynomial with a short polynomial. Next, as 7680 and 10752 are divisible by 1536 = 3 · 2 9 , both of these moduli support complete transforms modulo X 1536 − 1, which is all that we need for Saber and the NTRU arameter sets except ntruhps4096821. For LAC, the coefficients are even smaller so this is no problem. For implementing the length-1728 NTT that we need in the remaining NTRU parameter set ntruhps4096821, the two 16-bit primes 3457 and 8641 are used. Their product is sufficiently large, they support complete length-1728 NTTs and they are even slightly smaller than the primes described above, which is good for modular reductions.
So, algebraically, for Saber we compute the map where ζ i denote all the primitive 512-th roots of unity in Z q . For ntruhrss701 and ntruhps2048677 we compute where ζ i denote all the primitive 512-th roots of unity. For ntruhps2048509 we compute with ζ i again ranging over all the primitive 512-th roots of unity. Then, for ntruhps4096821 we compute where ζ i denote all the primitive 576-th roots of unity. Finally, for LAC, we do where ζ i denote all the primitive 512-th roots of unity.

Register allocation
Intel's Skylake and later microarchitectures have a throughput of 2 vector multiplications per clock cycle with a latency of 5 cycles [Fog20]. The addition and subtraction instructions have a throughput of 3 instructions per cycle since they can go to a third execution port that is not able to execute multiplications. Their latency is 1 cycle. Hence, the subtraction instruction in Algorithm 15 ideally does not compete with the multiplication instructions for execution resources, and the maximum theoretical throughput is 2/3 vector mulmod operations per cycle, or 32/3 scalar modular multiplications per cycle. On the other hand, the critical path of a vector mulmod consists of two multiplication instructions and a subtraction and thus has a latency of 11 cycles. In order for the code to not be completely latency-bounded and get near the maximum throughput, it is important that there are always many independent mulmods that can be computed in parallel. In principle, the out of order execution capability allows the CPU to find independent mulmods. But in practice the code will not come from the small uop cache and the instruction fetch from the L1 instruction cache is limited to 16 bytes per cycle, which translates to only less than about three vector instructions per cycle on average. So the code is likely to bottleneck on the front-end of the pipeline and the instruction decoding will not be able to run sufficiently far ahead for the CPU to be able to find independent instructions if they are far apart in the code. Hence it is important to schedule the instructions so that as many mulmods as possible are as close as possible. We achieve this by filling as many vector registers as possible with polynomial coefficients to operate on under the constraint that we also need auxiliary registers for constants and scratch registers for intermediate results.
Then we can compute several NTT layers with loading coefficients only once, and, after only a few layers, arrive at polynomials that we completely load into the registers.
We also experimented with more refined approaches to scheduling where we implemented several parallel mulmods in an interleaved fashion so that we could schedule the addition and subtraction instructions in a way that they do not steal execution resources from the multiplication instructions. The downside of this approach is that by interleaving mulmod operations one needs more scratch registers so that one can either only operate on fewer polynomial coefficients at a time or needs to temporarily store away some of the coefficients. In the end we found that not doing this and letting the register renaming capability of the CPU take care of allocating scratch registers from the register file leads to superior results. In the two-power NTTs we always have 8 vector registers with a total of 128 polynomial coefficients loaded whereas in the NTT for NTRU whose length 1536 is divisible by 3 we have always 12 registers with 192 coefficients loaded.

Range Analysis
For the two primes q = 7681 and q = 10753 that we use, it is not possible to compute all the layers of the NTT using straight-forward radix-2 steps without performing additional modular reductions. We assume that the input polynomials we want to transform have coefficients less than 4096 in absolute value. This is true for all our applications without first reducing the polynomials modulo q. Now, by [Sei18, Lemma 2], the output coefficients of Algorithm 15 lie in the interval [−q, q]. So, using this approximation, we find for the forward negacyclic NTT with Cooley-Tukey butterflies that the coefficients grow by at most q in absolute value in each layer of the NTT. It then follows that we can only perform 2 layers without additional reductions. Instead, we use a more refined range analysis where for each layer and a given input range we compute the maximum range of the modular products. This then determines the range of the output coefficients, which form the inputs for the next layer. With this analysis we find that we can compute three layers of radix-2 splittings without additional reductions, both in the cyclic and in the negacyclic NTT. After these three layers we twist all the factors into rings of the form Z q [X]/(X n − 1). The advantage of twisting the factors instead of merely reducing coefficients is that this results in fewer modular multiplications in subsequent layers. Moreover, the mulmods as in Algorithm 15 are even slightly more efficient than for example Barrett reductions as they have the same throughput but shorter dependency chains. Concretely, splitting rings of the form Z q [X]/(X n − 1) does not need any mulmod. But for later factors of this form we do in fact sometimes multiply coefficients by 1 in order to reduce them. We then recursively compute the following maps with 16n mulmods, where ζ ∈ Z q is a primitive 8-th root of unity, → Z q [X]/(X n − 1) × · · · × Z q [X]/(X n − 1) Table 5: Saber Performance results in clock cycles for core arithmetic operations on Cortex-M4 and Skylake. The Inner-product computation in our AVX2 implementation for SABER does not contain the cost of computing the NTT of one of the input vectors. In encryption the NTT of the secret vector is already computed for the matrix vector product. For decryption the secret vector can be stored in NTT form in the secret key, which does not need to be compatible with other implementations.

Results
In this section, we describe the benchmarking results for our Saber, NTRU, and LAC implementations. First, we describe our benchmarking setup for the Cortex-M4 and Skylake and then we report our results for Saber, NTRU, and LAC in Sections 5.1, 5.2, and 5.3.
Benchmarking setup for the Cortex-M4. Our benchmarking setup is based on the pqm4 [KRSS] benchmarking framework and as such produces comparable cycle counts to previous work [BMKV20,KRS19]. We target the STM32F407-DISCOVERY board which has a STM32F407VG core. We clock it at 24 MHz with no flash wait states to obtain similar cycle counts as the ones reported in pqm4. For obtaining randomness, we use the hardware random number generator. As both NTRU and Saber make use of SHA-3 and SHAKE, we make use of the optimized assembly implementations of Keccak from the XKCP 2 which is also contained in pqm4. LAC relies on AES and SHA-2 which we source from [SS17] and SUPERCOP 3 respectively. All cycle counts in the following were obtained for implementations compiled with gcc and -O3 (arm-none-eabi-gcc, Version 10.2.0).
Benchmarking setup for Skylake. The cycle counts for AVX2 were obtained on a Intel Core i7-6600U (Skylake) processor with a base frequency of 2.6 GHz. As usual we disable TurboBoost and hyperthreading. We compile our implementations with gcc version 7.5.0 and use the compiler flags -O3, -fomit-frame-pointer, -march=native, -mtune=native. All cycle counts are the median cycle counts of 10 000 executions.  Table 5 contains the performance results for the polynomial arithmetic speed-ups in Saber. We report the results for matrix-vector multiplication A · s as used in key generation and encryption and vector-vector inner multiplication b T · s as used in encryption and decryption separately. The dimension of the matrix is l × l and the dimension of the vectors is l × 1. The dimension l = 2, 3, 4 correspond to parameter sets Lightsaber, Saber, and Firesaber.

Saber results
On Cortex-M4, we obtain cost reduction between 58% and 61% for A · s and between 42% and 44% for b T · s. The cost reduction on Skylake range from 25% to 39% forA · s and from 47% to 60% for b T · s. Table 6 illustrates the resulting performance of Lightsaber, Saber, and Firesaber on the Cortex-M4 when our fast MatrixVectorMul and InnerProduct are plugged into them. In addition to the full CCA-secure KEM schemes, we also report cycle counts for the underlying CPA-Secure PKE. While those are not explicitly exposed in the Saber specification, all our optimizations were inside of the CPA primitives and, hence, the overhead of the CCA transformation did not change. Moreover, some schemes use considerably more expansive CCA transforms than others. For example, Saber and Kyber include very costly public key and ciphertext hashes in their CCA transforms that could be omitted in a different choice of transform.
On Cortex-M4, we achieve significant cost reductions of consistently more than 20%. For CPA-secure decryption, we get the most notable cost reduction of 38%.  Table 7 shows the results for polynomial multiplication for NTRU for the four different polynomial degrees used in ntruhps2048509, ntruhps2048677, ntruhrss701, and ntruhps4096821. On the Cortex-M4, for the smallest polynomial size n = 509, our implementation using NTTs is performing only slightly better than the Toom4 implementation [KRS19]. For the larger sizes, the cost reduction on the Cortex-M4 is more pronounced with 10% or more. On AVX2, n = 509 is the only polynomial size for which we were not able to obtain a speed-up using NTTs. All other parameter sets have small cost reduction of 7% to 15%. The reason why we didn't achieve a speed-up for n = 509 is partly because we chose a different vector layout and shuffling strategy in the length-1024 NTT compared to the other NTTs. The advantage of the different vector layout is that it is easier to precompute the constant vectors and they need less space. But they require more loads. In principal the loads don't compete with the arithmetic because they go to separate execution ports and can be dispatched in parallel. Unfortunately, it turned out that this does incur a penalty, most likely because the code is bottlenecking on the front-end. We leave it as future work to optimize the length-1024 NTTs as well as the other NTTs. Table 8 reports the results for the full NTRU schemes. As we only optimize polynomial multiplication in this paper and key generation is dominated by polynomial inversion, we do not see a big difference in cycle counts across all parameter sets and platforms. On the Cortex-M4, encapsulation is 1% to 6% faster while decapsulation is 2% to 4% faster. For the underlying CPA-secure PKE, we achieve higher speed-ups with 2% to 13% fewer cycles which comes as no surprise as we did not modify the CCA transformation. Table 9 summarizes the speed of the (big by small) polynomial multiplication in LAC. We can see that our code is faster than that of [LLZ + 18] by a factor of 10× on the Cortex-M4 and a factor of 3× to 7× on Skylake.