Polynomial Multiplication in NTRU Prime - Comparison of Optimization Strategies on Cortex-M4.

. This paper proposes two diﬀerent methods to perform NTT-based polynomial multiplication in polynomial rings that do not naturally support such a multiplication. We demonstrate these methods on the NTRU Prime key-encapsulation mechanism (KEM) proposed by Bernstein, Chuengsatiansup, Lange, and Vredendaal, which uses a polynomial ring that is, by design, not amenable to use with NTT. One of our approaches is using Good’s trick and focuses on speed and supporting more than one parameter set with a single implementation. The other approach is using a mixed radix NTT and focuses on the use of smaller multipliers and less memory. On a ARM Cortex-M4 microcontroller, we show that our three NTT-based implementations, one based on Good’s trick and two mixed radix NTTs, provide between 32% and 17% faster polynomial multiplication. For the parameter-set ntrulpr761, this results in between 16% and 9% faster total operations (sum of key generation, encapsulation, and decapsulation) and requires between 15% and 39% less memory than the current state-of-the-art NTRU Prime implementation on this platform, which is using Toom-Cook-based polynomial multiplication.


Introduction
Due to the ongoing advances in quantum computing, the threat by quantum computers to IT-security becomes more and more imminent: Experts predict that sufficiently large and stable quantum computers running Shor's algorithm for factorization and solving discrete logarithms may be able to break currently wide-spread asymmetric cryptographic primitives in the next ten to fifteen years. Therefore, in the research field Post-Quantum Cryptography (PQC), researchers haven been investigating alternative cryptographic schemes that are believed to be secure against attacks aided by quantum computers.
PQC-primitives based on lattice problems have attracted significant attention due to their efficient implementations that often are on par with or even better than current cryptographic schemes. This attention is reflected in the NIST post-quantum cryptography standardization process, as nearly half of the candidates are using hard lattice problems as their building blocks. Among the other lattice-based NIST candidates, the key encapsulation mechanism (KEM) NTRU Prime [BCLvV17], which has advanced to the third round as alternate candidate, differentiates itself by its choice of the polynomial ring using an Related Work. There has been work conducted before to improve the efficiency of polynomial multiplication in lattice-based schemes in general and for NTRU and NTRUrelated schemes like NTRU Prime specifically. Also, there has been some work on implementing NTRU Prime for embedded devices.
For example, in [MKV20] Mera et al. investigate the use of Toom-Cook multiplication to speed up lattice-based cryptography, specifically the KEM scheme Saber with polynomials of degree 256. They report performance numbers for AVX and for ARM Cortex-M4 with assembler optimizations.
Hülsing et al. provided an efficient implementation of a variant of NTRU for the AVX2 vector instruction set [HRSS17]. They are using several recursive levels of Karatsuba for polynomial multiplication. The work by Lyubashevsky and Seiler [LS19] is using an NTT to achieve a fast implementation of an NTRU variant.
NTRUEncrypt has been optimized, for example, for the AVX2 vector instruction set by Dai et al. in [DWZ18] using a combination of Karatsuba and Toom-Cook for polynomial multiplication with sparse index-based multiplication for the final polynomials of degree smaller than 32. An implementation of NTRUEncrypt for embedded systems on an 8-bit AVR microcontroller is provided by Cheng et al. in [CGRR19] using optimization techniques for sparse polynomial multiplication.
There is an implementation of NTRU Prime for the Haswell x86 architecture using AVX2 vector instructions by Bernstein 1 using Good's trick and the Chinese Remainder Theorem (CRT) to enable the use of the NTT for the design and the parameters of NTRU Prime. We will take a closer look at this approach in Section 3. Cheng et al. provide an efficient implementation of NTRU Prime on an ATmega1284 8-bit AVR microcontroller [CDG + 19] using Karatsuba-based polynomial multiplication and efficient modular reduction. Kannwischer et al. performed measurements of the C reference implementations of several PQC schemes on a Cortex-M4 platform (without any optimizations) [KRSS19]. They report performance numbers for NTRU Prime as well.
However, the current state-of-the-art implementation of NTRU Prime on Cortex-M4 is the work by Yang et al. 2 that was included as optimized implementation for NTRU Prime in the pqm4 project in April 2020 3 . It is using Toom-Cook for polynomial multiplication, fast modular inversion from [BY19], and platform-specific, hand-written assembly optimization. It is up to two orders of magnitude faster than the C-reference code reported in [KRSS19]. This work is the basis of our optimizations for polynomial multiplication. Performance values are listed in Table 6 for comparison to our improvements.
Our Contributions. We present, evaluate, and compare two methods to implement NTT-based polynomial multiplication in Z 4591 /(X 761 − X − 1) accompanied with three implementations. We use the following two methods, one being more generic and the other more parameter specific: • Good's Trick -We implemented an NTT for Z q /(X N1 −1), where N 1 = p ·2 k ≥ 2·p, p is a small prime, and q is selected to ensure that there will be no modular reduction in the coefficients of the resulting polynomial.
• Mixed Radix NTT -We implemented an NTT for Z q /(X N2 − 1), where N 2 = t · 2 k · 3 l · 5 m · 7 n ≥ 2 · p, t is a small integer, and q ≡ 1 mod N2 t as well as an NTT for Z q /(X 1530 − 1), where 1530 is the smallest divisor of q − 1 which is bigger than 2p.
The NTRU Prime submission has two additional parameter sets which use q = 4621 and q = 5167, and q − 1 can be factored as 2 2 · 3 · 5 · 7 · 11 and 2 · 3 2 · 7 · 41, respectively. Thus the techniques described in this paper can be applied to all parameter sets in the submission. Although the techniques are not new, we present their first implementation in latticebased cryptography. Thus, we focus on implementation issues instead of implementing all parameter sets of NTRU Prime. Our implementations are publicly available under an open source license at https://github.com/vincentvbh/NTRUPrime-PolyMul.
Structure of this Paper. Section 2 provides some background information on NTRU Prime and on using NTT for polynomial multiplication. In Section 3 we introduce our approaches for the implementation of polynomial multiplication using NTT for odd sizes including Good's trick. Section 4 describes the implementation of our two approaches for improving polynomial multiplication. Section 5 provides an evaluation of our work and a comparison of our improvements with prior art. Finally, Section 6 concludes our work.

Preliminaries
In this section, we recall the NTRU Prime key encapsulation scheme and we provide an overview of the number theoretic transform when used for polynomial multiplication.

NTRU Prime
The authors of NTRU Prime [BCLvV17] propose "an efficient implementation of highsecurity prime-degree large-Galois-group inert-modulus ideal-lattice-based cryptography." NTRU Prime tweaks the classic NTRU scheme to use rings without exploiting special structures of the rings. The NTRU Prime submission to the NIST standardization process [BCLvV19] provides two KEM schemes: Streamlined NTRU Prime and NTRU LPRime. Both schemes share common notations and definitions for the parameters and theorems. The parameters are a prime number p ≥ 17, a prime number q and a positive integer If there are exactly w coefficients that are nonzero, then the weight of the element is w. We define the set of the elements of the ring Z[x]/(x p − x − 1) that have a small weight-w as Short. The set Rounded is defined as the set of polynomials . . , −6, −3, 0, 3, 6, . . . (q − 1)/2} for q ∈ 1 + 3Z or in {−(q + 1)/2, . . . , −6, −3, 0, 3, 6, . . . (q + 1)/2} for q ∈ 2 + 3Z. Please note that we will abbreviate the rings as R, R/3, and R/q, respectively. NTRU Prime defines two deterministic algorithms called HashConfirm and HashSession that are using a function Hash. Hash(z) returns the first 32 bytes of SHA-512(z) and Hash b (z) is defined as Hash(b, z) prefixing the input z with a one-byte value b ∈ {0, . . . , 255}. HashConfirm(r, h) is defined as Hash 2 (Hash 3 (r), Hash 4 (h)) for r ∈ Short in Streamlined NTRU Prime and r ∈ {0, 1} I in NTRU LPRime where h is the public key and I ∈ 8Z + . The algorithm HashSession(b, r, C) is defined as Hash b (Hash 3 (r), C) for b ∈ {0, 1}, r same as above, and z is the ciphertext of the respective scheme.
Theorem 2 ([BCLvV17, Theorem 2]). Let p ≥ 3 and w ≥ 1 be fixed integers. Let r, g ∈ Z[x] be polynomials of degree at most p − 1 with each coefficient in {−1, 0, 1}. Assume that r has at most w nonzero coefficients. Then gr mod Theorem 3 ([BCLvV17, Theorem 3]). Let p ≥ 3 and w ≥ 1 be fixed integers. Let m, r, f, g, ∈ Z[x] be polynomials of degree at most p − 1 with each coefficient in {−1, 0, 1}. Assume that f and r each have at most w nonzero coefficients. Then 3f m+gr mod x p −x−1 has each coefficient in the interval [−8w, 8w].

Streamlined NTRU Prime
Streamlined NTRU Prime (sntrup) has two layers: • a perfectly correct deterministic PKE as inner layer and • a perfectly correct KEM as outer layer.
The inner layer, Streamlined NTRU Prime Core, has parameters (p, q, w) where p and q are prime numbers, w is a positive integer such that 2p ≥ 3w, q ≥ 16w + 1, and x p − x − 1 is irreducible in the polynomial ring (Z/q) [x]. The parameter sets of Streamlined NTRU Prime are listed in Table 1. The algorithms for key generation, encapsulation, and decapsulation of Streamlined NTRU Prime are shown in Algorithms 1, 2, and 3.
Switching the ring of an element. Decapsulation in Streamlined NTRU Prime needs to map polynomials between R/q and R/3 as shown in line 2 and 4 of Algorithm 3. While MaptoR/3 performs c j = (a j mod ± q) mod ± 3 for each coefficient, MaptoR/q performs c j = (a j mod ± 3) mod ± q, which simply changes the ring of the arithmetic operations without changing the signed representation of the coefficients. Streamlined NTRU Prime utilizes the Fujisaki-Okamoto (FO) transformation [FO13] to construct a CCA secure KEM. This transformation involves re-encryption of the decrypted message to check if the ciphertext was correctly generated using the encryption algorithm. This re-encryption can be seen in lines 5 − 7 of Algorithm 3. The comparison with the original ciphertext is performed in line 8 in Algorithm 3.
Encoding and decoding bit strings to polynomials. Encapsulation and decapsulation of NTRU LPRime need to encode bit strings to polynomials. The operation Encode is called in the line 7 of Algorithm 5 and line 8 of Algorithm 6, while Decode is used only in line 2 of Algorithm 6. The Encode function encodes an I-size bit string r = (r 0 , r 1 , . . . , r I−1 ) to a polynomial bA by performing the computation The Decode function generates a bit string from aB and T by computing The NTRU LPRime scheme also uses an FO transformation for CCA security. The re-encryption stage of this transformation can be seen in lines 3 − 9 in Algorithm 6.

Number Theoretic Transform
As mentioned before, one popular method for implementing polynomial multiplication is to apply a number theoretic transform (NTT) and point-wise multiplication. This approach is very attractive, because it has quasi-linear complexity. The NTTx of a vector x ∈ Z N q is defined asx for an nth-root of unity ψ in Z q . Since this requires that an nth root of unity exists in Z q , q is called an "NTT-friendly prime", if Z q has an nth root of unity. This means that a size-N NTT operation is equivalent to a matrix multiplication with an N × N matrix A that consists of the coefficients a i,j = ψ (i−1)(j−1) . A naive implementation, however, will result in O(N 2 ) complexity for the operation and result in no advantage over other multiplication routines. A well known divide-and-conquer strategy exists for cases in which N is not prime [CT65].
In such cases, the NTT operation can be realized by combining the results of N/p smaller NTT operations on vectors of size p. These smallest NTT operations over vectors of prime size are referred to as butterflies in the literature. This is due to the "butterfly-shape" of diagrams mapping the signal flow in such operations. Because of this structure, the immediate outputx k of the algorithm appears in an order different from that of the input. In the popular case of a transforms that only comprises radix-2 stages, the output is in bit-reversed order compared to the input order (see Figure 1 for an example).
For the general case, one can define an index calculation function R p1,...,pn for an NTT using n layers with radix-p i on layer 1 ≤ i ≤ n in a recursive manner as R p (k) = k for an index k and This can be used to express the output order of an NTT. For example, the "digit reversed" index permutation dr 270 of a 270-NTT that applies one radix-2, three radix-3, and finally one radix-5 stage can thus be expressed as For the application of the NTT, it is practical to reorder to the inputs of the transformation in order to attain an output in normal order. If the transformation is used for polynomial multiplication, the order of the output is irrelevant and the normal input order can be used. In this case, the index permutation can be incorporated into the inverse transform instead. For arithmetic in Z q , the possible input sizes are determined by the prime factors of q − 1, as only nth roots of unity exist if n divides q − 1. This also determines the radix-p stages that are applied when performing a given transform, but not the order in which they are applied. Although the NTT algorithm can work for any size, it can use recursive structures when the size is a highly composite number, i.e., a power of a small prime. Below, we describe two tricks to implement an NTT more efficiently when the size has a special form that is not a power of a prime.

Rader's Trick
In [Rad68], Rader proposed a method to compute a prime-size NTT for a prime p. The method transforms the multiplication with the twiddle factors to a polynomial multiplication of size p − 1. For a polynomial a = The first observation of [Rad68] is that the first coefficientx 0 of the NTT can be computed as the sum of the coefficientsx 0 = p−1 i=0 x i . The second observation of the paper is that x 0 is always multiplied with 1 during the calculation of the other indices. Thus, the calculation of the other indices takes the form (1) After moving x 0 to the left hand side, the sum in Equation (1) becomes a multiplication of p − 1 pairs of coefficients in Z q , but the order of the coefficients used would not form a polynomial multiplication as desired. Rader proposed a permutation to transform the sum into a polynomial multiplication modulo x p−1 − 1. The permutation uses the fact that there is a number g in [0, p − 1] that can form a bijection from [1, p − 1] to [1, p − 1]. Using this, the index calculation function can be expressed as k = g i mod p. This changes Equation (1) tox This technique is useful, especially for the implementation of mixed radix NTT, because the butterfly operations are basically small prime-size NTTs -with the exception of the radix-2 butterfly, where p − 1 = 1 and the required operation becomes simple integer multiplication. Table 3 shows an example of this index permutation for p = 17 and g = 3. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j 6 2 12 4 7 8 14 16 11 15 5 13 10 9 3 1 Table 4: Good's permutation for size 12.

Good's Trick
In [Goo51], Good proposed a method to perform a size-(p 0 ·p k 1 ) NTT as a combination of p 0 size-p k 1 NTTs where p 0 and p 1 are small prime numbers. This technique maps polynomial This also requires a permutation of the coefficients of the input polynomial. Using the fact that p 0 and p k 1 are relatively prime, the index calculation applies the CRT to obtain x i = y i0 z i1 . As an example, the permutation of the indices for an input of size 12 is given in Table 4. We will use this trick for polynomials that have a degree less than half of the size of the polynomial multiplication. Using the above permutation after zero-padding of a polynomial of degree 5, the two-dimensional polynomial representation is a 5 x 5 + a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0 = (a 2 z 2 + a 5 z)y 2 + (a 1 z + a 4 )y + (a 3 z 3 + a 0 ).
We explain the internals of this method for selected parameters in Section 3.1.

Approaches
In NIST's post-quantum cryptography mailing list 4 , Bernstein shared cycle counts of an NTRU Prime implementation on the Haswell architecture 5 . The software uses Good's trick with an additional CRT map as suggested in [Pol71]. The CRT map allows us to use a modulus for coefficient-wise operations that is a product of two or more smaller, more NTTfriendly primes. The implementation utilizes three multiplications in Z 7681 /(X 512 − 1) and three multiplications in Z 10753 /(X 512 − 1) together with Good's trick to perform multiplication in Z 7681 /(X 1536 − 1) and Z 10753 /(X 1536 − 1), respectively. Finally, the implementation computes coefficient-wise CRT to perform the multiplication in Z 82593793 /(X 1536 − 1), which is similar to the polynomial ring used in our implementation.
Bernstein's method requires us to perform two NTT-based polynomial multiplications, which is very suitable for the AVX2 vector extensions: The AVX2 extension has special instructions for performing 16 multiplications of 16-bit inputs, i.e., VPMULLW and VPMULHW, and Montgomery multiplication can be implemented very efficiently for 16 coefficients in parallel. However, AVX2 does not have similar instructions for 32-bit integers. This makes performing two polynomial multiplications with 16-bit base more efficient than one polynomial multiplication with 32-bit base.
Although the Cortex-M4 architecture also has some special instructions for 16-bit integers, most instructions operate on 32-bit integers. Hence, performing two polynomial multiplications with smaller moduli is not an efficient choice on a Cortex-M4 processor. Instead, we decided to use size 1536 NTTs, size 1620 NTTs, and size 1530 NTTs as described in the following.

Products in Z q [x]/(x 761 − x − 1) using Size 1536 NTT
For the NTRU Prime parameter sets sntrup761 and ntrulpr761, we have p = 761 and q = 4591. On a first glance, it seems intuitive to use size 1536 = 2 9 · 3 NTTs for p = 761, because that is the nearest "nice" number suitable for the NTT. But q = 4591 does not have roots of order 1536. We can choose between the following options: 1. Interpose rings with roots of unity e.g. of order 64 (Schönhage/Nussbaumer), or 2. Switch to a NTT-friendly ring Z q , where 1536|(q − 1), q > 2 · 2295 · 1521, so that the products in coincide 6 -we find q = 6984193 = 4547 · 1536 + 1, or 3. Use two or more NTT-friendly moduli (e.g. 7681 and 10753) whose product is larger than 2 · 2295 · (2 · 761 − 1) as above, then assemble the results using the CRT (i.e., Bernstein's approach).
Option 1 replaces multiplications with moves and additions/subtractions, which is beneficial for platforms such as FPGAs, but not the Cortex-M4 with relatively cheap multiplications. Option 3 would require to run the same size-1536 NTT multiple times, while Option 2 runs it only once but with larger operands. In general on the Cortex-M4 using the larger operand is better as long as it is feasible. The reason is that while loads and additions/subtractions are up to twice as fast for the smaller operand size, the multiplications are not. Therefore, an NTT modulo 7681 or 10753 (16-bit operands) does not cost less than half of an NTT modulo 6984193 (32-bit operands).
Good's trick. When using Option 2, we are performing multiplication in Z q [x]/(x 1536 −1). There are three approaches to do an FFT multiplication of size 3 · 2 k : 1. Standard Cooley-Tukey FFTs, including one radix-3 stage, 2. Incomplete NTTs, splitting down to degree-2 polynomials, followed by a point multiplication stage modulo x 3 − ψ i for various powers of ψ, 2 k th root of unity, and then a matching incomplete inverse NTT, or 3. Good's FFT trick [Goo51,Ber], where we set x = yw with y 2 k−1 = −1 = w 2 + w.
When applying Good's trick, each multiplicand , with y-degree less than 2 k and w-degree less than 3. We may split it as f 0 (y) + wf 1 (y) + w 2 f 2 (y) with y j may be referred to as Good's permutation. We follow with a size-2 k FFT with respect to the variable y on each multiplicand, represented by three parallel size-2 k NTTs. Then we do "point" multiplication by multiplying together degree-2 polynomials in w modulo w 3 − 1, do an inverse size-2 k FFT (represented by three inverse NTTs), and then finally undo Good's permutation.
• In Good's trick, the size-2 k NTTs have operands continuous in memory (which may allow to use multi-word memory access instructions); in an incomplete NTT, the size-2 k NTTs have their coefficients spaced three indices apart.
• In Good's trick, point multiplications work with coefficients spaced 2 k slots apart in memory, but are modulo w 3 − 1. In an incomplete NTT, point multiplications have operands contiguous in memory, but are modulo x 3 − ψ i for different powers of ψ.
After applying Good's trick, we may also perform a two-dimensional FFT on both y and w, which would be three-fold parallel size-2 k NTTs followed by 2 k parallel size-3 NTTs. Then "point" multiplication would be just simple modular products, to be followed by an inverse 2-dimensional FFT. This is better than Approach 1 in that in a complete Cooley-Tukey FFT, after the initial radix-3 stage one would have to transform ("twist") two of the three degree < 2 k polynomials from modulo x 2 k − ω 3 and x 2 k − ω 2 3 to modulo x 2 k − 1 by multiplying each coefficient with a different root of unity for the same effect.
Note: Good's trick is a general statement about a product of coprime groups giving a tensor product of group rings. Given a root of unity of order 3 · 2 k , multiplication of degree-2 polynomials may be done using size-3 NTTs, but mostly we just need a root of order 2 k . A significant advantage is that Good's permutation can usually be achieved "for free" by careful rearrangement of loops and index variables.

Products in Z q [x]/(x 761 − x − 1) using Size 1620 NTT
The multiplication of two elements in Z q [x]/(x 761 − x − 1) will result in polynomials of degree at most 1520 if it is performed in Z q [x]. One can facilitate this multiplication using an incomplete size 1620 NTT in a manner similar to the polynomial multiplication of Kyber v2 [ABD + 19] as described in [BKS19]. Instead of applying the NTT to all of the coefficients f i of a polynomial f , effectively six transforms are used on 270 coefficients at a time. The transformed 1620 coefficients are viewed in 270 groups of 6 (or polynomials of degree 5) as Polynomial multiplication using this incomplete NTT requires that the point-wise multiplication of these polynomialsf (k) is performed in a different ring Z q [x]/(x 6 +ψ dr270(k) 270 ) for each coefficient. After applying the six inverse transforms on the product, the resulting polynomial can be projected from Z q [x] to Z q [x]/(x 761 −x−1). The choice for an incomplete NTT in Kyber v2 was motivated by a change of the underlying field that did no longer include 512th roots of unity. We implemented this approach because it is more generic allowing code generation for an NTT without hand-optimized large-radix butterflies.

Products in Z q [x]/(x 761 − x − 1) using Size 1530 NTT
One can also implement a size-1530 NTT for q = 4591, since 4591 ≡ 1 mod 1530. Although this would be also a mixed-radix implementation, it would require bigger butterfly operations, e.g., radix-17 butterfly. The components of a size-1530 NTT multiplication are radix-17 butterfly, radix-5 butterfly, radix-3 butterfly and radix-2 butterfly. Note that each butterfly should be performed twice forward and once backward, so there is less and less benefit to perform butterflies in the last several stages than just performing multiplication of small degree polynomials. Hence we decided to perform multiplications of degree-9 polynomials rather than performing radix-2 and radix-5 butterflies. As a result, we chose to use another incomplete NTT for the size-1530 NTT, and perform a radix-17 butterfly followed by two radix-3 butterflies for each of the two input polynomials. Then, the multiplication requires to perform 153 point-wise multiplications of degree-9 polynomials in different rings.

Implementation
In this section, we discuss the implementation of our approaches for polynomial multiplication in NTRU Prime. We integrated our optimizations into the existing state-of-the-art Cortex-M4 implementation of NTRU Prime in the pqm4 project to be able to directly compare our improvements to this implementation. We avoided to use secret dependent branches, and use Barrett and Montgomery modular reductions to ensure the running time is independent from secrets.

Two-cycle Barrett reduction.
When implementing Barrett reduction for signed integers, the output of the procedure often has some bias in its sign. For example, for any q, 32-bit Barrett reduction can be implemented as where β = 2 32 q . Thus, this algorithm cannot reduce numbers between q and t = 2 32 β . Actually it can be seen that for t = t − q the output of the Barrett reduction would be in (−k · t , q + (k · t )). The factor k is determined by the input size of the reduction. t can be decreased by rounding the result of the division by β to the nearest integer. Computing β with ceiling changes the sign of the output range, thus the reduction outputs are more likely going to be negative numbers. Usually these are the only two options to tune the output of the Barrett reduction when integers are used.
However, the ARM Cortex-M4 architecture has an extension for rounding the high bits of the multiplication results, which can be used to reduce the output size. This instruction adds 2 31 to the result of the multiplication of two 32-bit integers and returns the most significant 32-bits of the result. Thus, the output of the Barrett reduction is similarly distributed over positive and negative numbers, i.e., its output range is (− q+kt 2 , q+kt 2 ). Our two-cycle implementation 7 of Barrett reduction can be seen in Algorithm 7.

32-bit Montgomery multiplier.
While the 32-bit Barrett reduction would be enough for a 16-bit modulus, the output range of the reduction would be bigger than the modulus when it is also 32-bit. Thus, the two-cycle Barrett reduction is not an efficient choice for implementing Good's trick. However, the Cortex-M4 architecture has also extended multiplication instructions for 32-bit full multiplication preferably accompanied with 64-bit addition. Hence, one can implement a three-cycle Montgomery multiplication for 32-bit integers as in Algorithm 8.
The first split in the polynomial. The CRT map starts with Z q /(X N − 1) and splits it into two small polynomials as Z q /(X N 2 − 1) × Z q /(X N 2 + 1). The operations during the 7 We can easily substitute −q −1 and mla (multiply-add) for q −1 and mls (multiply-subtract).
first layer of the NTT can be simply interpreted as reducing the input polynomial modulo (X N 2 − 1) and (X N 2 + 1). The mixed-radix NTT starts with the original order of the input polynomial and the size of the NTT defined as bigger than twice the degree of the input polynomial. Therefore, we do not need to perform any polynomial reduction during the first layer of the NTT.
Good's trick needs a reordering of the input polynomial to perform three small NTTs. Although the size of the multiplication is bigger than twice the degree of the input polynomial, one needs to consider the inputs of the small NTTs. The reordering process simply distributes the low degree coefficients of the input polynomials to the low degree coefficients of each input of the NTTs. Thus, the first layers of all three NTTs can also be omitted.
Using NTT-based multiplication in NTRU Prime. The most obvious optimization for NTT-based multiplications is to keep polynomials in NTT domain whenever this is possible. Although the secret and public keys can be also kept in NTT domain, our implementation needs at least double-sized arrays to represent polynomials in NTT domain. Therefore, we only used this optimization inside of the low-level operations. NTRU Prime has such a case only in the encryption process: The polynomial b used in bG (line 4) and bA (line 5) in Algorithm 5. Thus, we transform b only once and use the result in NTT domain for the two multiplications.

Floating point registers.
Microcontrollers of the ARM Cortex-M4 family have only 14 available general purpose registers, which might cause some additional memory operations during NTT computations for register spills. Although our implementation does not make use of floating-point operations, there are 32 single-precision floating-point registers. Those registers can be used to store commonly used variables to avoid memory-load and -store operations. Instead, a vmov instruction is used to transfer data to and from a floating-point register, which only takes one cycle in each direction. Another use of those registers is to temporarily store the content of the stack pointer and the link pointer in order to make all integer registers available for calculations.

Implementation of Good's Trick for Size 1536 = 3 · 2 9
Conceptually, using Good's trick to multiply is first to copy each multiplicand to a temporary array and perform Good's permutation followed by three simultaneous NTTs. Then we do "point multiplication" as the small convolutions modulo x 3 − 1. Finally, we do three inverse NTTs, the inverse of Good's permutation, and reductions modulo q = 6984193, q = 4591 and then x 761 − x − 1. In detail, the implementation is as follows: • We first apply Good's permutation combined with the initial three NTT levels: array, at least one starts the NTT as zero. Therefore, for the NTT at level 0 we only need at most four loads of entries (spaced 192 apart) and some negations. However, negations cost nothing, because, as shown in Algorithm 9 and Algorithm 10, the radix-2 butterfly and negated radix-2 butterfly cost exactly same number of cycles.
We trace the signs and the numbers as in Figure 2 to handle negations and such that for a small input, when multiplied to a root, we use mul and not Montgomery's multiplication, which saves two instructions each time.

Input: a, b
Output: −a + b, −a − b  1: rsb a, a, b  a ← b − a  2: sub b, a, b, LSL#1 b ← a − 2b     • We do Cooley-Tukey butterflies, computing (a, b) → (a+wb, a−wb) by first computing wb via Montgomery multiplication with w = 2 32 w mod q as in Algorithm 8 and then add-subtract (a, ωb). This can be done in place in only two instructions. The convolution modulo x 3 − 1 is shown in Algorithm 12 using the preparation of registers for Montgomery multiplication from Algorithm 11.
• The inverse NTT also uses Cooley-Tukey butterflies and proceeds almost exactly as above in three rounds of three levels each, except that the indices are permuted, a different roots table is required, and of course level 0 is nontrivial.

Implementation of Mixed-Radix NTT Multiplication
The size N of a complete NTT has to divide q − 1 = 4590 = 2 · 3 3 · 5 · 17 such that an N -th root of unity exists in Z q . Since the implementation should avoid any polynomial reduction, a natural N would be 2 · 3 2 · 5 · 17 = 1530 > 2p. One visible drawback is the need to implement a radix-17 NTT or butterfly. Another is that every other parameter set would need to be implemented separately, potentially with butterflies with even larger radixes and even more complex implementations. Alternatively, a smaller N can be chosen, with fewer or no large butterflies, if an incomplete mixed-radix NTT is implemented.
Choices. We provide two mixed-radix implementations in our work: 1. We implemented size N = 270 = 2 · 3 3 · 5 FFTs involving one radix-2 stage, three radix-3 stages, and one radix-5 stage. To use this for multiplication, we need 270k ≥ 2p − 1 = 1521, and we see that the smallest k = 6. So the length of our incomplete NTT is 1620. After such an incomplete NTT, component-wise multiplication is not in Z q but in Z q [X]/(X 6 − ψ i 270 ).
2. We implemented a length-1530 incomplete mixed-radix NTT. For a mixed-radix NTT implementation, Cooley-Tukey and Gentlemen-Sande butterfly operations can be used as demonstrated in Figure 3. On the one hand, the Gentlemen-Sande butterfly needs to transform all polynomials to (X d − 1) after each CRT layer, i.e., we need to evaluate the polynomial (X N 2 + 1) with N 2 -th root of −1 after the first CRT split. On the other hand, the Cooley-Tukey butterflies needs different powers of the n-th root of unity to compute each output of the butterfly operations.
The Cooley-Tukey butterfly can be optimized with the observation that ψ n 3 n = ψ 3 . Thus, multiplication with ψ j n and ψ 2j n can be moved to the beginning of the butterfly computations to have the same type of multiplication as the Gentlemen-Sande butterfly. However, we would still need to perform modular reduction for the first output more often than with the Gentlemen-Sande butterfly. Hence, we decided to implement Gentlemen-Sande butterflies to optimize register usage and performance of each butterfly operation for the incomplete mixed-radix NTT Option 1, e.g. the implementation which comprises only small radixes. But the radix-17 implementation requires a sum of 17 variables for the first output, thus Gentlemen-Sande butterfly also requires modular reduction for all of its output. Therefore, we decided to implement Cooley-Tukey butterflies to combine all multiplications with Rader's trick for the incomplete mixed-radix Option 2. The ARM Cortex-M4 architecture has special instructions (smlad(x), smuad(x), smlsd(x), smusd(x)) that can perform two 16-bit signed multiplications plus one or two 32-bit addition/subtractions in one cycle. Thus the Cooley-Tukey type butterfly can compute each output in one cycle for radix-3 as in Gentlemen-Sande type butterfly before the modular reductions. In our implementation Option 1, in addition to radix-3 butterflies (see Figure 3), we also needed radix-2 and radix-5 butterflies as shown in Figure 4. On the other hand, in the implementation Option 2, we need Cooley-Tukey version of radix-3 from Figure 3 together with the radix-17 implementation described in Appendix B of the extended version of this paper [ACC + 20].
Implementation of radix-17 butterfly. The radix-17 butterfly can be seen as a size-17 NTT. Thus, Rader's trick can be applied to transform it into a polynomial multiplication in Z q /(X 16 − 1). Since 2 −1 exists modulo q, CRT can be used to split the ring as Z q /(X 8 − 1) × Z q /(X 8 + 1). One can also use CRT for the Z q /(X 8 − 1) = Z q /(X 4 − 1) × Z q /(X 4 + 1) to reduce the size of the multiplication even further. Note that using CRT for Z q /(X 8 + 1) requires √ −1 modulo q, which does not exist for q = 4591. After the above CRT map, we perform two 4-by-4 and an 8-by-8 polynomial multiplications to apply Rader's trick. For a more detailed description of the radix-17 butterfly see [ACC + 20, Appendix B].
Base multiplication for degree-5 polynomials. The final component-wise multiplication becomes a multiplication in Z q /(X 6 −ψ i 270 ). We implemented this using a O(n 2 ) schoolbook multiplication routine. Similar to the radix-5 butterfly operation, multiplications for even and odd indices of the output are combined together. The even indices require an even number of multiplications with ψ i 270 . Thus, they can be packed together to use the smladx instruction. We compute the odd indexed coefficients of a · b where a = a 0 + a 1 x + a 2 x 2 + a 3 x 3 + a 4 x 4 + a 5 x 5 and b = b 0 + b 1 x + b 2 x 2 + b 3 x 3 + b 4 x 4 + b 5 x 5 , then compute the even indexed coefficients by transforming a to a = a 5 ψ i 270 + a 0 x + a 1 x 2 + a 2 x 3 + a 3 x 4 + a 4 x 5 and then compute the odd indices of a · b.

Evaluation
The pqm4 framework provides an infrastructure for measuring the execution time of cryptographic primitives on a Cortex-M4 microprocessor. The framework measures the number of cycles required for key generation, key encapsulation, and key decapsulation. Furthermore, the framework also provides an infrastructure to measure the stack memory used by different implementations.
We compare our results with implementations provided by the pqm4 project. The current state-of-the-art implementation of NTRU Prime for Cortex-M4 by Yang et al. 8 (referred to as "Toom-Cook") is using Toom-Cook multiplication and has been part of the pqm4 library since April 2020 9 as mentioned in Section 1.
The cycle counts of the different optimized implementations of polynomial multiplication are shown in Table 5. When compared to the Toom-Cook implementation, our implementation using Good's trick and the first mixed-radix implementation are 30% faster. The second and more generic mixed-radix implementation is 17% faster. All implementations provide a special NTT version for polynomials with coefficients in [−1, 0, 1]. Thus, the cycle counts provided for the NTT are an average of an NTT with input coefficients modulo q and with sparse inputs where non-zero coefficients can be only ±1.
by Bernstein and Yang in [BY19], the key generation still is dominated by the time spent on the generation of polynomials instead arithmetic operations on them. As a result, our implementations show less than 1% speed-up during key generation (G) of sntrup761, while our implementations show improvements for encapsulation (E) and decapsulation (D) respectively of E: 10% and D: 22% using Good's trick or mixed radix implementation of size 1530 NTT as well as E: 5% and D: 10% using the size 1620 mixed radix version.
The ntrulpr761 key generation requires no polynomial inversion. Thus, we are able to see the effect of our implementations better in this scheme. Our versions using Good's trick and mixed radix (1) have G: 10%, E: 15%, and D: 20% speed improvements in key generation, encapsulation, and decapsulation respectively compared to the Toom-Cook based implementation. Furthermore, the mixed radix (2) implementation has a speed-up of G: 5%, E: 10%, and D: 11% for the same operations.
Since, except for polynomial multiplication, we mostly used existing code from the Toom-Cook implementation, the differences in the cycle counts of the key generation in ntrulpr761 are exactly the difference of the polynomial multiplication between our implementations and the Toom-Cook version. The encapsulation primitive in ntrulpr761 requires two multiplications with a common multiplier and thus the difference is as big as the difference of two polynomial multiplications plus the time spent for one NTT. Because decapsulation uses encapsulation as a part of the Fujisaki-Okamoto transform, the difference can be calculated in a similar fashion.

Conclusion
In this paper, we present three efficient and constant-time implementations of the two NTRU Prime schemes Streamlined NTRU Prime and NTRU LPRime. Considering the parameter sets of NTRU Prime with p = 761 and q = 4591, our implementation using Good's trick overall has slightly better performance when comparing with the mixed-radix version but requires 32-bit multipliers and noticeably more memory, which might be an issue on constrained devices or platforms like Cortex-M3-based platforms that have no constant-time 32-bit multiplier. The mixed-radix version has very close performance and it requires less memory as well as it can be implemented with smaller multipliers.
Another difference of our three implementation approaches is their applicability. Our fast mixed-radix implementation is mostly parameter-set specific and it requires a new design for other parameters of NTRU Prime. The version using Good's trick can be used for more than one parameter set of NTRU Prime when the selected q and N 1 cover the full multiplication for the target polynomial. For example, NTRU Prime has another parameter set with p = 653 and q = 4621. Using a similar calculation as described in Section 3.1 with q = 6984193, which is larger than 4 · 2310 · 653 = 6033720, and N 1 = 1536, which is larger than 2 · 653, one can see that almost the same implementation can be used for this parameter set by only changing the last polynomial reduction. However, the mixed-radix version for the same polynomial with q − 1 = 4620 = 2 2 · 3 · 5 · 7 · 11 will require more adjustments since the q is different. Although Good's trick has more flexibility, the third parameter set of NTRU Prime needs different choices for q and N 1 and therefore a new implementation is required to apply this approach.
As a result, we recommend to use Good's trick for larger systems, e.g., CPU's with vector extensions, or for supporting more than one parameter set to reduce engineering effort. Furthermore, we recommend to use the mixed-radix version for smaller microcontrollers, FPGA implementations, and hardware accelerators, where only small multipliers are available, and for finite-field instruction-set extensions using small multipliers as discussed in [AEL + 20]. Both approaches presented in this paper are suitable for other NTRU Prime parameter sets. Since Good's trick would also work for mixed-radix NTT, it would be interesting to combine the two techniques in the mixed-radix approach.