Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4

. We propose NTT implementations with each supporting at least one parameter of NTRU and one parameter of NTRU Prime. Our implementations are based on size-1440, size-1536, and size-1728 convolutions without algebraic assumptions on the target polynomial rings. We also propose several improvements for the NTT computation. Firstly, we introduce dedicated radix-(2 , 3) butterﬂies combining Good–Thomas FFT and vector-radix FFT. In general, there are six dedicated radix-(2 , 3) butterﬂies and they together support implicit permutations. Secondly, for odd prime radices, we show that the multiplications for one output can be replaced with additions/subtractions. We demonstrate the idea for radix-3 and show how to extend it to any odd prime. Our improvement also applies to radix-(2 , 3) butterﬂies. Thirdly, we implement an incomplete version of Good–Thomas FFT for addressing potential code size issues. For NTRU, our polynomial multiplications outperform the state-of-the-art by 2 . 8% − 10 . 3%. For NTRU Prime, our polynomial multiplications are slower than the state-of-the-art. However, the SotA exploits the speciﬁc structure of coeﬃcient rings or polynomial moduli, while our NTT-based multiplications exploit neither and apply across diﬀerent schemes. This reduces the engineering eﬀort, including testing and veriﬁcation.


Introduction
Shor's algorithm for integer factorization and discrete logarithm threatens public-key cryptosystems based on RSA and ECC [Sho97].Since then, researchers have been developing cryptosystems without known weaknesses from quantum computers.This line of research is known as "post-quantum cryptography".In 2016, the National Institute of Standards and Technology (NIST) called for proposals replacing existing standards for public-key cryptosystems with schemes resisting attacks by quantum computers.
Recent research has shown that the number-theoretic transform (NTT) plays an important role in the implementations for lattice-based submissions, including Dilithium [BHK + 22], Kyber [BKS19, AHKS22, BHK + 22], NTRU [CHK + 21], NTRU Prime [ACC + 21], and Saber [CHK + 21, ACC + 22, BHK + 22, BMK + 22].If NTT is natively supported, then we can apply it directly.On the other hand, if NTT is not natively supported, we have to choose a large NTT-friendly polynomial ring covering the maximum value of the result in Z[x].This approach is known as "NTT multiplications for NTT-unfriendly rings" by [CHK + 21, FSS20].On Cortex-M4, the most compelling non-NTT-based approach is the Toeplitz matrix-vector product (TMVP) exploiting the structure of (weighted) convolutions without changing the coefficient rings [IKPC20,IKPC22].[IKPC20] demonstrated the idea for Saber on the ARM Cortex-M4.Their implementation remains the fastest non-NTT-based approach on Cortex-M4 compared to [BMK + 22].Recently, [IKPC22] showed that TMVP is faster than the NTT-based multiplication by [CHK + 21] for NTRU on Cortex-M4.
We propose improvements for NTT-based multiplications balancing between performance and code size without assuming any algebraic properties of the target coefficient rings.Our NTT-based multiplications outperform [IKPC22] for parameters ntruhps2048677, ntruhrss701, and ntruhps4096821.Since no algebraic properties are assumed, our implementations naturally extend to NTRU Prime while TMVP may not apply to NTRU Prime, with its polynomial modulus x p − x − 1, without doubling the size of convolutions Cortex-M4 implementations targeting the board STM32F407-DISCOVERY, [KRS19, BKS19, IKPC20, MKV20, ABCG20, ACC + 21, CHK + 21, GKS21, AHKS22, ACC + 22, IKPC22] reported performance numbers while clocking at 24 MHz to avoid wait states.A much more meaningful way for benchmarking is to report the numbers at the full speed, 168 MHz, of the board.This would illustrate to the users the impact of the performance while adjusting the frequency for their development.We take this under consideration and report at both 24 MHz and 168 MHz.Our implementations are designed with compact code size and negligible performance penalties while raising the frequency.
Contribution.Our contribution is summarized as follows.
1. We implement NTT-based convolutions, each supporting multiple parameters.
3. We point out an overlooked optimization for all non-radix-2 butterflies.In particular, for a radix-r butterfly with r = 2, we replace (r − 1) multiplications and 1 addition with (r − 2) additions and 1 subtraction.This extends the existence of subtraction in radix-2 butterflies to all butterflies.Thus, it is applicable to other platforms and other implementations computing non-radix-2 butterflies.We further apply this optimization to our radix-(2, 3) butterflies implementing vector-radix FFT.
4. We reduce code size while permuting coefficients implicitly for Good-Thomas FFT.
To enable this, we formally present the original Good-Thomas FFT [Goo58] as an isomorphism from an associative algebra to a tensor product of associative algebras, which justifies the existence of incomplete versions of Good-Thomas FFT different from [Ber01] and [ACC + 21]1 .To demonstrate the practical code size advantage, we benchmark our NTT-based polynomial multiplications at the full speed of our board.
Dedicated radix-(2, 3) butterflies and improved non-radix-2 butterflies are algorithmic improvements applicable to other platforms.While our code size optimization for Good-Thomas FFT is specific to our board STM32F407-DISCOVERY, our demonstrated approach for addressing code size issues has other benefits, e.g.potential to facilitate vectorization.
We discuss two use cases where multiplications based on our NTT-based convolutions are more favorable: (i) Plural schemes with comparable parameter sets are selected by NIST and other institutions (cf.OpenSSH).(ii) Multiple parameter sets of one scheme are being implemented (here, for the M4).For (i), state-of-the-art multipliers for NTRU [IKPC22] and NTRU Prime [Che21] exploit the special structures of polynomial moduli or coefficient rings.So adapting code for NTRU Prime and NTRU for each other is hard.Since we compute the results in Z[x], the only distinctions are the reductions to the target polynomial rings.For (ii), each of the multipliers for NTRU Prime by [Che21] only supports one parameter.The multipliers by [IKPC22] (without doubling polynomial degrees) support NTRU parameters up to a fixed degree, and our NTT-based convolutions support NTRU and NTRU Prime parameters up to a fixed degree.Notice our NTT-based multipliers with compact code sizes are faster than the unrolled multipliers by [IKPC22].

Related work.
The incomplete transformation of Good-Thomas FFT can already be deduced from [Goo58].[FP07] was aware of the use of incomplete Good-Thomas FFT for vectorization, but it is unclear if their program Spiral picked incomplete Good-Thomas FFT as the best strategy of code generation.[ACC + 21] and [Che21] implemented polynomial multiplications for NTRU Prime on Cortex-M4.[ACC + 21] also explained how to permute implicitly for Good-Thomas FFT by the proposed dedicated 3-layer radix-2 butterflies.[IKPC20] proposed the use of TMVP for Saber on Cortex-M4.Shortly after, [CHK + 21] implemented NTT-based polynomial multiplications for LAC, NTRU, and Saber on Cortex-M4 and Skylake with AVX2.Finally, [IKPC22] applied TMVP to NTRU on Cortex-M4.
Structure of this paper.This paper is structured as follows: Section 2 is the background.Section 3 introduces our NTT improvements.Section 4 describes our implementations of NTT-based multiplications.Section 5 gives performance numbers.

Preliminaries
Section 2.1 introduces the target polynomial multiplications in NTRU and NTRU Prime.Section 2.2 introduces the Cortex-M4 and the implementations of modular reductions and multiplications.Section 2.3 explains NTTs.We then review various kinds of FFTs including Cooley-Tukey (Section 2.4), Good-Thomas (Section 2.5), and vector-radix FFT (Section 2.6).Finally, Section 2.7 explains how to apply NTTs to NTT-unfriendly rings.

Polynomial Multiplications in NTRU and NTRU Prime
The NTRU submission [CDH + 20] comprises parameter sets for two similar schemes NTRU-HPS and NTRU-HRSS, both operating on the polynomial rings Z 3 [x]/ Φ n , Z q [x]/ Φ n , and Z q [x]/ x n − 1 .Here q is a power of 2, n is prime, and Φ n is the polynomial The NTRU Prime submission [BBC + 20] consists of two families of schemes NTRU LPRime and Streamlined NTRU Prime.Both operate in the polynomial rings Z 3 [x]/ x p − x − 1 and Z q [x]/ x p − x − 1 for various p and q, primes such that Z q [x]/ x p − x − 1 is a finite field.We focus on polynomial multiplications in the rings Z q [x]/ x n − 1 of NTRU and Z q [x]/ x p − x − 1 of NTRU Prime each with one operand ternary (coefficients in {−1, 0, 1}).See parameters in Tables 1-2.
While NTRU-HRSS requires no sampling of polynomials with fixed numbers of {−1, 0, 1} coefficients, NTRU-HPS, NTRU LPRime, and Streamlined NTRU Prime call a sorting network subroutine for the sampling.Furthermore, inverting polynomials are required in the key generations of NTRU-HPS, NTRUHRSS, and Streamlined NTRU Prime.We refer to the specifications [CDH + 20, BBC + 20] for more details.

Cortex-M4
As selected by NIST for evaluating PQC candidates on micro-controllers, the ARM Cortex-M4 is our target platform for implementing PQC schemes.The Cortex-M4 implements the Armv7E-M architecture.Some of the most relevant features are as follows: General-purpose registers: There are 16 core registers, named r0-r15.Except for the stack pointer (r13) and the program counter (r15), all other core registers can be treated as general-purpose registers.
Barrel shifters: Shifts and rotates (asr, lsl, lsr, and ror), come at no extra cost when used as the "flexible second operand" of a standard data-processing instruction.
We first describe the multiplication instructions.mul multiplies two 32-bit values and places the lower 32-bit result to the first (destination) register.mla accumulates the 32-bit result to an accumulator, and mls subtract the 32-bit result from the accumulator.The accumulators are the last arguments named to mla and mls.smull multiplies two 32-bit values and places the 64-bit result in two destination registers.The first-named register holds the lower 32-bit result, and the second-named register holds the upper 32-bit result.smlal accumulates the 64-bit result to the destination registers.umull and umlal are their unsigned counterparts.smmul returns the upper 32-bit of a 64-bit product of two 32-bit values.The suffix r indicates that the 64-bit product is first rounded to the upper 32-bit.
Modular reductions and multiplications are the most critical parts of NTT-based polynomial multiplications.We implement 32-bit Barrett reduction [Bar86], 32-bit Montgomery reduction, and 32-bit Montgomery multiplication [Mon85,Sei18].Throughout this paper, we assume R = 2 32 and let mod ± be the signed modular reduction.32-bit Barrett reduction maps a value a to a − a R q R q as an approximation of a − a q q = a mod ± q. Algorithm 1 is an illustration.Montgomery multiplication, denoted as montgomery(a, b), To see why it is a reduction, we observe that in terms of absolute values, Concretely, we denote mMul_des_32(l, h, a, b, t) as the 32-bit Montgomery multiplication with input registers a, b and output register h as shown in Algorithm 2.

Number-Theoretic Transforms
In this paper, we assume readers are already familiar with the language of tensor product of associative algebras over the same commutative ring.Nevertheless, we go through some important concepts in algebra.A homomorphism from an algebraic structure to another one is a structure-preserving map.We call it an isomorphism if there is a one-to-one correspondence between the domain and the range.The importance of isomorphisms is the underlying computational costs for converting expensive computations into cheap computations without losing any algebraic properties.Let R be a commutative ring and f (x) a polynomial with coefficients in R. The polynomial ring R[x]/ f (x) is an associative algebra over R. If f (x) takes the form x n − 1 for some n ∈ N, R[x]/ x n − 1 is a group algebra since it can be constructed naturally by taking elements in the cyclic group (Z n , +, 0) as the basis.Polynomial multiplications in R ) is also an associative algebra over R. We also write R[ ) .We refer to [Jac12, Sec.3.9] and [Bou89, Chap.III, Sec.4.1] for a more formal treatment.At a high level our implementations convert the multiplication in R[x]/ x n − 1 into the multiplication in a tensor product of associative algebras.
We proceed with the definitions of the number-theoretic transforms (NTTs).If ζ m is invertible in R, m is coprime to the characteristic of R, and there is a principal m-th root of unity ω m , we can define the size-m NTT, NTT R d−1 , it can be shown that the above condition is equivalent to the existence of n fulfilling n|0(q

Cooley-Tukey and Gentleman-Sande FFTs
Cooley-Tukey [CT65] and Gentleman-Sande [GS66] FFTs are popular approaches for computing size-m NTTs for highly composite m.For a factorization of m = m 0 m 1 , Cooley- m and size-m 1 NTTs with ω m1 := ω m0 m as follows: ωm by a permutation.If m = r k for some r and k, we can apply a k-level split where each level consists of several size-r NTTs.It can be easily shown that there is a radix-r reversal between the result of CT FFT and the straightforward computation.Note that the first level maps a polynomial a(x) to a (ζ m1 ω i0 m0 ) i0 by evaluating x n m 0 at several ζ m1 ω i0 m0 's.Such evaluations are also called Cooley-Tukey butterflies.Figure 1a is an illustration for (a ) 0 (up) and (a ) 1 (down) where m 0 = 2 (and w 2 = −1), and Figure 2a is an illustration for (a ) 0 (up), (a ) 1 (middle), and (a ) 2 (down) where m 0 = 3.
m y requires multiplications by powers of ω i0 m .It is usually written as the map x n m 0 → ω i0 m y and is also called "twisting".Figures 1b and 2b are illustrations for m 0 = 2 and m 0 = 3.In this paper, we retain the language of equivalence relations for clear explanations of optimizations, but one must keep in mind that we still need multiplications.
If there are no r, d 0 , d 1 ∈ N satisfying m 0 = r d0 and m 1 = r d1 , we call the FFT "mixed-radix".

Good-Thomas FFT
Let q 0 and q 1 be two coprime integers.Good-Thomas FFT transforms NTT R[x]:q0q1: , and e 0 and e 1 are orthogonal idempotent elements (e i0 e i1 = δ i0,i1 e i0 where δ is the Kronecker delta) realizing a ≡ e 0 (a mod q 0 ) + e 1 (a mod q 1 ) (mod q 0 q 1 ) [Goo58, Section 12][Tho63].Since cyclic NTTs require fewer multiplications than acyclic ones, this is more favorable than the mixed-radix size-q 0 q 1 CT and GS FFT.[Ber01] and [ACC + 21] stated the transformation as turning a group algebra into a tensor product of group algebras by introducing x ∼ x (0) x (1) .However, this statement is actually weaker than the originally proposed formulation by [Goo58].We describe a statement for convolutions implied by the work of [FP07, Paragraph Vectorized FFT, Section 3].
Let n = vq 0 q 1 with q 0 ⊥q 1 .Applying Good-Thomas FFT to convolution is to introduce the equivalence x (1) .[FP07] was aware of the transformation, but it is unclear if their program Spiral generates the transformation.Moreover, they overlooked two important implementation aspects of multi-dimensional NTTs: (i) vector-radix FFT in the next section, and (ii) code size while permuting with Good-Thomas FFT as shown in Section 3.2.

Vector-Radix FFT
Vector-radix FFT is a multi-dimensional generalization of FFT introduced by [HMCS77].The core idea is that for a d-dimensional , the computations of each dimension are independent of each other and we can interleave them freely as long as the order of operations in the same dimension is preserved.Since each dimension can be written as a sequence of additions and multiplications, we can interleave the computations such that strings of multiplications are adjacent to each other.If we look closely into the strings of multiplications from several dimensions, we find that a lot of entries are multiplied by more than one twiddle factor.If we precompute the twiddle factors multiplied to the same entry, then we save multiplications.We illustrate the idea of [HMCS77] with a 2-dimensional NTT with dimension 2 × 2. Let NTT (0 id be the identity map, and addsub := (a, b) → (a + b, a − b).Applying 1-dimension NTTs is to compute with NTT (1) ⊗ id • id ⊗ NTT (0) as follows.a 0,0 a 0,1 a 1,0 a 1,1 Vector-radix FFT first decomposes NTT (0) into addsub • (x (0) → ζ 0 ) and NTT (1) into addsub • (x (1) → ζ 1 ).If we group the multiplications together, we have Note that the core idea of vector-radix FFT is that multiplications from different dimensions can be merged.This holds if we compute both dimensions with GS FFT, and in general, even if we compute CT in one dimension and GS in the other.

NTT Multiplications for NTT-unfriendly Rings
A sequence of ring operations can be computed in a larger ring if the results of each step can be uniquely identified in the larger ring.The approach for lifting a coefficient ring to a larger one is known as "NTT multiplications for NTT-unfriendly rings" [CHK + 21, FSS20].In a broader sense, we use the same terminology for the lifting a polynomial modulus to a larger polynomial modulus in this paper.

Number-Theoretic Transforms
This section introduces some theoretical aspects on implementing NTTs.Section 3.1 introduces an optimization applicable to non-radix-2 butterflies.Section 3.2 illustrates a potential code size issue about Good-Thomas FFT.Section 3.3 demonstrates a principle for balancing between code size and performance.

Improving Non-Radix-2 Butterflies
We first discuss an interesting observation regarding non-radix-2 butterflies.For simplicity, we compare radix-2 and radix-3 CT butterflies, but our observation can be generalized to arbitrary radices.In this section, we assume ψ is an invertible element.
Comparing to Algorithm 3, 3 cycles are saved for each radix-3 butterfly.

Generalization to Arbitrary Radices
We now generalize the idea to other radices.For computing c(ψω i r ) 0≤i<r from c(x) = r−1 k=0 c k x k , we pick a j indicating which multiplications to be replaced.If ψ = 1, we pick j > 0. Since rc 0 = r−1 i=0 c(ψω i r ) by the definition of ω r , we have This implies that after computing c(ψω i r ) − c 0 = r−1 k=1 c k ψ k ω ik r for all i = j, c(ψω j r ) can be computed with r − 2 additions and one subtraction.Therefore, our idea replaces r − 1 multiplications with r − 2 additions.Finally, we subtract r−1 i=0,i =j c(ψω i r ) − c 0 from c 0 .One should be aware that for a cyclic size-r NTT, there are many approaches other than the näive radix-r butterfly if r is odd.Since Good-Thomas FFT applies whenever r has more than one prime factor, we may assume that r = p k is a prime power.Winograd's FFT exploits the multiplicative structure of the unit group of Z p k to transform the size-p k NTT into a size-p k−1 (p − 1) convolution [Win78].For k > 1, since p is odd and p⊥p − 1, we apply Good-Thomas FFT to transform the size-p k−1 (p − 1) convolution into a multi-dimensional convolution.Therefore, we may restrict to the case k = 1 [Rad68].If p − 1 has more than one prime factor, we can also apply Good-Thomas FFT.Since p − 1 is even, we may assume p − 1 = 2 h for some h.It is well known that if 2 h + 1 is a prime, then h = 2 t for some t.Therefore, we only need to focus on Fermat primes 2 2 t + 1 of which four are known: 3, 5, 17, and 65537.We already discuss the case r = 3.In the next section, we demonstrate the benefit of our ideas for r = 5 using 32-bit arithmetic on the Cortex-M4.
All in all, in the case of r = 5, our improved näive butterfly outperforms the original näive butterfly and the Rader approach.For r = 17, Rader's approach is probably better because one can apply more layers of Cooley-Tukey butterflies.

Code Size Consideration of Good-Thomas FFT
In this section, we point out a potential issue of Good-Thomas FFT if we permute the coefficients on-the-fly.We illustrate the issue for transforming a size-2 k0 3 k1 1-dimensional FFT into a 2-dimensional FFT of dimensions 2 k0 × 3 k1 with 3 k1 < 2 k0−1 .Furthermore, we also assume that the upper half of the input polynomial are all zeros, and compute with dedicated radix-(2, 3) butterflies.

Combining Cooley-Tukey, Good-Thomas, and Vector-Radix FFTs
We describe how to compute size-q 0 qq 1 v convolutions where q 0 is a power of 2, q 1 < q0 2 is a power of 3, and q⊥3.Furthermore, we require that at most one of q and v is greater than 1.Here v measures the degree of incompleteness of the coprime factorization of the Good-Thomas FFT, and q the incompleteness of the Cooley-Tukey FFT.We will fix q and v at the end.
We now explain how to pick v for controlling code size.We have to keep in mind that the code size of the initial layer is determined by the period and the number of distinct loops for calling dedicated radix-(2, 3) butterflies.From the previous section, we must have code for q 2 1 3 dedicated radix-(2, 3) butterflies.For Cortex-M4, q 1 = 3, 9 are reasonable numbers since there are only 3 2 3 = 9 or 9 2 3 = 27 dedicated radix-(2, 3) butterflies to be programmed.For the size-1440, size-1536, and size-1728 convolutions, we choose (q, v) = (5, 1), (4, 1), and (1, 3) since 1440 = 32 Note that for some platforms, code size might not be a consideration.Very often, such platforms are powerful enough to support vector instructions.One should choose v as a multiple of the number of elements contained in a vector.Then, the entire computation, including dedicated radix-(2, 3) butterflies for on-the-fly permutations, requires no permutation instructions and additional memory operations.We believe this will be useful for platforms implementing Neon, MVE, AVX2, AVX512, SSE, and SSE2.

Implementations
In this section we go through our implementations for NTT-based polynomial multiplications with each supporting at least one parameter for NTRU and one parameter for NTRU Prime with little modifications.We distinguish two words: level and layer.We use the word "layer" for transformations in terms of mathematics and "level" for computations between a load and a store to the same memory address.All the cycle counts refer to the Cortex-M4 cycles.This section is structured as follows: Section 4.1 introduces dedicated radix-(2, 3) butterflies.Section 4.2 explains our implementations of size-1440, size-1536, and size-1728 convolutions.Section 4.3 introduces our choices of convolutions for NTT-based polynomial multiplications in NTRU and NTRU Prime.

Dedicated Vector-Radix Butterflies
We first introduce how to implement radix-(2, 3) butterflies while permuting the coefficients with Good-Thomas FFT.For simplicity, we illustrate the idea for R[x] x 6 − 1 .Let e 0 and e 1 be idempotent elements in Z 6 realizing i = (e 0 (i mod 2) + e 1 (i mod 3)) mod 6.

Implementing Convolutions
In this section, we describe in detail our chosen transformations.We illustrate the details of our size-1728 convolution because it is the most complicated one and because it includes all the ideas.We also denote each ∼ = as an isomorphism corresponding to a level of computation.Table 4 summarizes the transformations for convolutions, and Table 5 summarizes the implementations of transformations.
Size-1728 convolution.First of all, we introduce the equivalence x 3 ∼ x (0) x (1) to perform an incomplete permutation for Good-Thomas FFT.We now regard R = x 3 −x (0) x (1) as the coefficient ring.Since 1728 3 = 9 • 64, we perform a 2-dimensional FFT defined over the ring 64 −1 with vector-radix FFT.Our vector-radix FFT is built upon the tensor product of the size-9 CT FFT on R[x (0) ] x (0) 9 − 1 and the size-64 CT FFT on R[x (1) ] x (1) 64 − 1 .We apply one level of dedicated radix-(2, 3) butterflies, one level of radix-(2, 3) butterflies, and one level of 4-layer radix-2 butterflies.For applying radix-(2, 3) butterflies, we merge the multiplications of twiddles from different dimensions into the improved radix-3 butterflies.The main reason for introducing the equivalence x 3 ∼ x (0) x (1) is to permute the coefficients without a blow of code size as explained in Section 3.2.If we instead introduce x ∼ x (0) x (1) , then there is no hope to permute on-the-fly and compute dedicated radix-(2, 3) butterflies at the same time with compact code size.The entire computation computes takes the following route from .
Comparison to [CHK + 21].We compare our implementation to the size-1728 convolution by [CHK + 21].There are two differences: (i) the number of distinct twiddle factors for NTTs, and (ii) the approach for computing the result of a size-576 NTT.For (i), they required 9 + 31 + 63 • 8 = 544 distinct twiddle factors in their NTTs.We require 9 + 9 • 1 + 30 = 48 distinct twiddle factors where the 9•1 are the twiddles ω i0 9 ω 16 64 used in the vector-radix FFT.This implies fewer memory operations.For (ii), we compute the result of 1-dimensional size-576 NTT with 2-dimensional FFT where [CHK + 21] computes the result of 1-dimensional size-576 NTT with 1-dimensional FFT.There are two benefits: the 2-dimensional FFT requires fewer multiplications than the 1-dimensional FFT regardless of whether half of the inputs are zeros, and (ii) dedicated radix-(2, 3) butterflies save approximately 1.44 cycles for each entry while dedicated 3-layer radix-2 butterflies save only 1 cycle for each entry.

Multi-Parameter Support
As alluded to earlier, each of our NTT-based convolutions supports the polynomial multiplications of more than one parameter of NTRU and NTRU Prime.We compute the result in Z[x] with a chosen NTT-based convolution, and call the specific routine final_map for reducing the result to the target polynomial ring.Among our polynomial multiplications, this is the only difference for comparable parameter sets.Furthermore, our NTT-based convolutions for parameters with larger polynomial degrees apply to polynomial multiplications of the smaller parameter sets.Table 6 summarizes the applicability of our NTT-based convolutions.We discuss some possible scenarios demonstrating the benefits in reducing engineering effort.
If more than one comparable parameter is selected by NIST or other institutions (cf.OpenSSH).The first scenario is when more than one comparable parameter sets are selected by multiple institutions.The state-of-the-art polynomial multiplications for NTRU [IKPC22] apply only to the polynomial rings selected by NTRU.Adapting their multipliers incurs two significant performance penalties: (i) the arithmetic in NTRU is in Z 2 k while we need Z q for a prime q in NTRU Prime; and (ii) the polynomial moduli x p − x − 1 in NTRU Prime are incompatible to the structure of Toeplitz matrices without doubling the sizes of convolutions.(i) implies many modular reductions in Z q for a prime q, and (ii) implies one has to double the sizes of target convolutions.Next, the state-ofthe-art polynomial multiplications for NTRU Prime [Che21] rely on the special structure of coefficient rings for choosing the sizes of convolutions.They are therefore not applicable to NTRU.On the other hand, for comparable parameters in NTRU and NTRU Prime, we only need to replace the final_map reducing to the target polynomial rings while the other parts, including NTT, NTT_small, basemul, and iNTT, remain the same.If more than one parameter of a single scheme are selected for the Cortex-M4.Suppose we want to deploy multiple parameters of a scheme X on the Cortex-M4 where X is NTRU or NTRU Prime.The state-of-the-art polynomial multiplications for NTRU Prime on Cortex-M4 require one to provide a different source code for the multipliers.On the other hand, each of our convolutions supports polynomial multiplications up to a certain size.In the extreme case, our size-1728 convolution suffices for all parameters of X with a modified final_map.For the state-of-the-art polynomial multiplications for NTRU on Cortex-M4, the Toeplitz-matrix-based approach supports smaller parameters by padding some zeros.However, our more compact implementations are already faster than their unrolled multipliers as we will see in the next section.

Results
This section reports the performance numbers of our implementations.We will first describe our benchmark environment in Section 5.1.We will then go through the implementations of convolutions in Section 5.2 and illustrate their impact on NTRU and NTRU Prime in Section 5.3.All of our implementations target "big by small" polynomial multiplications in NTRU and NTRU Prime.

Benchmark Environment
We target the STM32F407-DISCOVERY board featuring an STM32F407VG Cortex-M4 microcontroller with 196 kB of SRAM and 1 MB of flash.Our benchmarking setup is based on pqm4 [KRSS].We clock at 24 MHz for benchmarking entire schemes.For individual functions, we clock at 24 MHz for consistent setup in the literature.Furthermore, we include cycle counts at 168 MHz to demonstrate the impact of code size.Although our code size optimization is specific to our board, our programs are designed with flexible loop-unrolling by simply adjusting the numbers.

Performance of Polynomial Multiplications
We compare our implementations to existing works.by 10.6%.Notice that when benchmarking at full speed (168 MHz), we only pay 1.3%−3.5% additional cycles.After carefully examining the implementations by [IKPC22], their implementations are fully unrolled.Although we believe that their implementations can be made much more compact with some care, our implementations at 168 MHz are already faster than their implementations at 24 MHz.This implies for practical deployment, users have a much wider range of frequency to fit the implementations into their use without sacrificing performance.Additionally, no algebraic properties are exploited in our implementations, while [IKPC22] only applies to (weighted) convolutions.Therefore, with little modifications, our implementations support polynomial multiplications in NTRU Prime as shown in the next section.

Polynomial Multiplications in NTRU Prime
We compare our polynomial multiplications to [ACC + 21] and [Che21, Method 2].Notice that the implementations by [Che21] made use of the special structures of the coefficient rings.For ntrulpr857/sntrup857, our size-1728 convolution outperforms the size-1722 convolution from [Che21] by 10.3%.For ntrulpr761/sntrup761, our size-1536 convolution outperforms all the implementations by [ACC + 21], but it is slower than the size-1530 We first compare the encapsulations.For ntruhps2048677 and ntruhps4096821, we outperform [IKPC22] by 35.7%−36.3%.The majority of the improvement comes from the more optimized crypto_sort.A fair comparison is ntruhrss701, where the only difference is one big by small polynomial multiplication.We outperform [IKPC22] by 2.2% for ntruhrss701.
For decapsulations, the differences between our implementations and the TMVP by [IKPC22] are one big by small polynomial multiplication and one polynomial multiplication in Z 3 .We outperform [IKPC22] by 0.7%−1.5%.
Our NTT-based multiplications have a limited impact on key generation.Key generations are dominated by computing inverses in Z q [x]/ x n − 1 .The inverses are first computed in Z 2 [x]/ x n − 1 and then lifted to Z q [x]/ x n − 1 .[Li21] implemented the inverses in Z 2 [x]/ x n − 1 with the fast constant-time GCD by [BY19].[IKPC22] applied their improved polynomial multiplications to lifting Z 2 [x]/ x n − 1 to Z q [x]/ x n − 1 .We simply integrate their work, and plug in the more improved crypto_sort.The majority of the improvement comes from [Li21].For the rest, the improvement mainly comes from the more improved crypto_sort and [IKPC22] for lifting to Z q [x]/ x n − 1 . 1 Integrated into pqm4 in commit 2691b4915b76db8b765ba89e4e09adc6b999763f.

NTRU Prime Performance
We apply our NTT-based multiplications to all the big by small polynomial multiplications in NTRU Prime.We replace AES with secret-dependent input by the fixslicing AES in [AP21].This increases the overall performance numbers of NTRU LPRime.Furthermore, we improve the crypto_sort.Table 10 summarizes the overall performance numbers.We first compare NTRU LPRime.[ACC + 21] reported performance numbers with an AES implementation with secret-dependent table lookup.At the same time, [AP21] proposed fixslicing AES.The timings increase drastically by changing to fixslicing AES for secret-dependent operations.Therefore, our ntrulpr761 is slower than the fastest approach by [ACC + 21] even though our polynomial multiplication is comparable to [ACC + 21].A fair comparison is comparing against [Che21].For ntrulpr653 and ntrulpr761, since our polynomial multiplications are slower than [Che21], the overall performance is expected to be slower.However, during the encapsulation, [Che21] computed two multiplications by a polynomial with two polynomial multiplications.We instead cache the NTT of one of the operands and reuse it later.This explains why our encapsulations are faster while key generations and decapsulations are slower than [Che21].
Next, we compare Streamlined NTRU Prime.In Streamlined NTRU Prime, we need crypto_sort in key generations and encapsulations.Since we optimize the crypto_sort, our key generations are faster than [Che21] even though our polynomial multiplications are slower than [Che21].Our decapsulations are slower than [Che21] since the only difference is two generic-by-ternary polynomial multiplications.
Finally, we present the performance numbers of ntrulpr857 and sntrup857.
and sending a (x ) to the m-tuple a (ζ), . . ., a (ζω m−1 m ) .If m = n, we call it complete NTT, and if m = n, we call it incomplete NTT.If ζ m = 1, we call it cyclic NTT and write it as NTT R[x]:n:ωm .When the context is clear, we simply say NTT.Furthermore, let us denote by ω n a principal n-th root of unity.If ω n exists, there are φ(n) choices of ω n sharing the same algebraic properties where φ is the Euler's totient function.For an m|n, we usually fix an ω n and define ω m := ω n m l n where l is coprime to m.

Figure 3 :
Figure3: Radix-(2, 2) butterfly and 2-layer radix-2 butterfly comparison.A group of four circles represents four coefficients.Each circulated region applies the tensor product of isomorphisms with the same color to a group of four.The left four are scaled versions of the inputs for the radix-(2, 2) butterfly, and the right four are the intermediate results of 1-dimensional radix-2 butterflies.Real lines send the scaled results and dotted lines send the negated and scaled ones.Several lines are omitted for clarity.For the details, please refer to the definitions of the symbols.

Table 1 :
NTRU parameter sets.Starred parameters are covered in this paper.

Table 2 :
NTRU Prime parameter sets.Starred parameters are covered in this paper.

Table 4 :
Summary of transformations.

Table 5 :
Summary of implementations.

Table 6 :
Summary of the applicability of NTT-based convolutions.Starred checkmarks are implemented in this paper.

Table 7 :
Benchmarks of polynomial multiplications.Numbers are rounded to the nearest thousands.We benchmark our implementations at both 24 MHz and 168 MHz.The first number is benchmarked at 24 MHz and the second one is benchmarked at 168 MHz.Implementations in the literature are all reported at 24 MHz by the authors.

Table 9 :
NTRU cycle counts of the fastest approaches in this work.