Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

. The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC ﬁnalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoﬀs for number-theoretic transforms (NTTs). Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the aﬃrmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack-and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.


Introduction
Shor's algorithm [Sho97] threatens all widely deployed public-key cryptography as it solves the integer factorization and the discrete logarithm problems on a quantum computer.Therefore, NIST has called for proposals to replace their existing standards for digital signatures and key encapsulation mechanisms (KEMs) [NIS].We are currently in the third round of the process, where 7 finalist schemes and 8 alternate schemes remain [AASA + 20].Of the 7 finalists, 4 are KEMs: Classic McEliece [ABC + 20], a code-based scheme, plus Kyber [ABD + 20b], NTRU [CDH + 20], and Saber [DKRV20], which are all lattice-based with similar performance characteristics.
Saber is based on the module learning with rounding (M-LWR) problem.Its arithmetic operates in the polynomial ring R q = Z q [x]/ x 256 + 1 with q = 2 13 and n = 256.One of Saber's distinguishing features, compared to Kyber [ABD + 20b], is the power-of-two modulus q = 2 13 (while Kyber uses the prime modulus 3329).Despite the architectural friendliness of power-of-two, the major disadvantage is the applicability of number-theoretic transforms (NTTs).Recent work by Chung et al. [CHK + 21] has shown that Saber can still profit from NTT multiplications by switching to a larger prime modulus allowing NTTs.Indeed, Saber with NTTs can also be significantly faster than Toom-Cook on the major NIST software targets: ARM Cortex-M4 and Haswell with AVX2.
We address three questions in this paper: 1.The Chung et al. [CHK + 21] implementation has a large memory footprint.Memory usage can prohibit implementations from being used on microcontrollers with much less memory available than the development boards commonly used in the literature.
Since the implementation solely uses stack memory (which is desirable for embedded implementations), reducing the memory consumption corresponds to optimizing the stack usage.Therefore, we explore how well NTT-based Saber performs stack-wise on the Cortex-M4.In particular, can we achieve a smaller memory footprint for Saber with NTTs compared to the [MKV20] stack-optimized Toom-Cook implementation?
2. The [CHK + 21] implementation relies on one of the multiplicands being small and only computes the correct 25-bit result.This is true for the secrets in Saber, but it does not apply to masked implementations in which the secret is arithmetically shared modulo q (e.g., [VBDK + 20]).How does Saber with NTTs perform for masked implementations, in particular, can they outperform masked Toom-based Saber from [VBDK + 20] in speed and stack usage?
3. While the Cortex-M4 is the primary microcontroller optimization target of NIST, its cheaper predecessor, the Cortex-M3 remains widely deployed, e.g., in hardware security modules (HSMs) like the STA1385 1 .However, the Cortex-M3 is slightly less powerful than the Cortex-M4 especially in terms of features critical to polynomial multiplication.In particular, long multiplications smull and smlal are not executed in constant time and, consequently, cannot be safely used when handling secret data.The Cortex-M4 implementation heavily relies on these instructions.So the open question is: Should Saber implementations targeting the Cortex-M3 use NTTs?
For Question 1, we propose an NTT-based implementation with a composite modulus q = q 0 q 1 , with q 0 and q 1 coprime and both NTT-friendly.We can thus define an NTT modulo q , enabling a very stack efficient implementation competitive in memory usage and at least 30% faster compared to the most stack-optimized Toom-Cook implementation.
We answer Question 2 in the affirmative by computing a 32-bit NTT and a 16-bit NTT.As long as the product of the moduli is bounding the coefficients of the masked product, the NTT-based multiplication can be viewed as a generic multiplier for Saber.
Finally, we answer Question 3 also in the affirmative.Here we have two natural alternatives in NTT-based polynomial multiplication using only 16-bit multiplications.One can use 32-bit NTTs but emulate the long multiplications (used already to implement Dilithium which requires 32-bit NTTs [GKS21]).Or one can adopt the approach of the AVX2 implementation of [CHK + 21] and use two 16-bit NTTs which can be efficiently implemented while avoiding long multiplications.The result is then recombined using the Chinese remainder theorem for integer rings.We show that both approaches are faster than Toom-Cook and the latter approach is the fastest.Furthermore, we also show that stack optimization on Cortex-M4 can be applied to the 16-bit NTT approach on Cortex-M3.stack-efficient implementations are using NTTs.Secondly, we exhibit two NTT-based Saber implementations on the Cortex-M3, both outperforming Toom-Cook.Lastly, masked Saber implementations are also best implemented using NTTs regardless of whether we value speed, memory or both.
In the process, we point out an overlooked stack optimization with multi-moduli NTTs.The optimization justifies an unconventional use of composite-modulus for unmasked Saber and unequal-size NTTs for masked Saber that have not been implemented before.Furthermore, we correct a misunderstanding regarding negacyclic convolutions by providing the actual if-and-only-if condition.Lastly, we justify the use of Cooley-Tukey butterflies for the inverse of negacyclic NTTs.Structure of the paper.This paper is structured as follows: Section 2 introduces Saber, ARM Cortex-M4 and Cortex-M3, and Montgomery multiplication.In Section 3, we present mathematics for NTTs implemented in this paper.In Section 4, we go through implementation details of MatrixVectorMul with different emphases.In Section 5, we present the performance of our implementations, and give some t-test results.

Preliminaries
This Section is organized as follow: First, we recall the key encapsulation mechanism Saber in Section 2.1.Section 2.2 introduces the architectures targeted in this paper: Cortex-m4 and Cortex-M3.Section 2.3 describes the Montgomery multiplication that is used throughout our implementations.

Saber
Saber [DKRV20] is a NISTPQC finalist candidate lattice-based key encapsulation mechanism.It is based on the Module Learning With Rounding (M-LWR) problem on the ring R q = Z q [x]/ x 256 + 1 .For all parameter sets q = 2 13 and n = 256.
Algorithms 1-3 are the CPA-secure scheme's keygen, encryption, and decryption and follow the submission material [DKRV20].Here Sample U samples from the uniform distribution, Sample B samples from a binomial distribution, and Expand expands a seed to a uniform matrix of polynomials.
Saber's most time-consuming operation in key generation and encryption is the matrixvector multiplication of polynomials A T • s and As .In decryption, the most expensive operation is the inner product b T • s.We do not further discuss Saber's CCA-secure KEM construction, which uses a variant of the Fujisaki-Okamoto (FO) transform due to Hofheinz-Hövelmanns-Kiltz [HHK17].We note that Saber does require re-encryption in the decapsulation, and, therefore, improving the encryption also improves decapsulation.Parameters.The module dimension l, the rounding parameter T , and the secret distribution parameter µ varies according to the parameter sets Lightsaber, Saber, and Firesaber (respectively targeting the NIST security levels 1, 3, and 5).See Table 1 for a summary.Hence, MatrixVectorMul is computing the product of an l × l matrix and an l × 1 vector, whereas InnerProd is computing the inner product of two l × 1 vectors.

ARM Cortex-M4 and Cortex-M3
The ARM Cortex-M4 is selected by NIST as a standard embedded platform to evaluate candidates (including Saber) in the NISTPQC process.For both scientific curiosity and practical reasons, we also implement Saber on the cheaper and also common Cortex-M3 to explore the variation in performance when some instructions are not supported or can only be used for secret-unrelated computations.The Cortex-M4 implements the ARMv7E-M architecture.Some of its most prominent features are as follows: • 14 General purpose registers.There are 16 registers, named r0-r15.Except for the stack pointer (r13) and the program counter (r15), all other registers are general purpose registers.
• Floating-point registers.There are 32 single-precision floating-point registers that can also be used as a low-latency cache (cf.[ACC + 21, CHK + 21].) • Cycles for load and store instructions.Store instructions are always one cycle.
A sequence of h loads with no dependency is always h + 1 cycles.
• Single cycle long multiplications.Long multiplications {u,s}mull and their accumulating counterparts {u, s}mlal are always one cycle.
• Barrel shifter.Shifts and rotates (asr, lsl, lsr, and ror), come at no extra cost when used as the "flexible second operand" of a standard data-processing instruction.The ARM Cortex-M3 implements the ARMv7-M architecture.The most important differences between Cortex-M3 and Cortex-M4 regarding constant-time implementation of Saber with NTTs are as follows [ARM10]: • No floating-point registers.There is no FPU, hence, we will experience more overhead when spilling registers.
• Early-terminating long multiplications.Long multiplications (and the variants with accumulation) {u,s}mull, {u,s}mlal are early-terminating instructions that cannot be used for computing on secret data.
• No SIMD instructions.There are no operations either treating registers as packed 8-bit or 16-bit elements or operating on specific halves of operands.

Montgomery multiplication
We employ Montgomery multiplication for computing mMul(a, bR where b is a known constant, R is a constant that is architecture-friendly and coprime to Q, and mod ± is the signed modular reduction giving values in where Qprime = −Q −1 mod ± R, and lo and hi are extractions of the lower log 2 R bits and upper log 2 R bits, respectively.In our implementations, we use either R = 2 16 or R = 2 32 .
In this section, we go over the mathematics for NTTs in their abstract form while maintaining the consistency of notations with the implementation details in Section 4. All the formulations are known in the literature with various abstractions.
An invertible size-n NTT taking a degree-(n − 1) polynomial from Z m [x]/ x n − ζ n is defined if and only if the following conditions are satisfied: 1. Divisibility: Suppose m admits the prime factorization m = p d0 2. Invertibility: ζ must be invertible in Z m [CF94].
Condition 1. enables NTTs over Z m [x]/ x n − 1 and Condition 2. extends the definition to Z m [x]/ x n − ζ n .Since 0(8192) = 1, Saber's coefficient ring is unfriendly for NTTs.We also note that the condition 0(m) can be generalized to finite commutative rings by [DV78,Theorem 4.].In Section 3.2.1,we adopt a more general definability about commutative rings (without requiring finiteness) for pointing out the connection to the Chinese remainder theorem for rings.
The Chinese remainder theorem (CRT) for (commutative) rings.Let R be a commutative ring, I i be ideals of R so that I i + I j = R for i = j, and δ be the Kronecker delta.Section 3 is all about the CRT in the abstract sense that the formulae are various instantiations of the isomorphism: [Für09, Theorem 2.4].The inverse can be written as where the unique (r 0 , r 1 , . . ., r n−1 ) satisfies r i r j = δ ij r i and n−1 i=0 r i = 1 [Bou89, Proposition 10 -(b), Section 8.11, Chapter I].Note that the existence of (r 0 , r 1 , . . ., r n−1 ) is equivalent to the existence of (I 0 , I 1 , . . ., I n−1 ).We will then review how the divisibility and invertibility conditions translate into φ and φ −1 by relating them to r i .This section is organized as follows: Section 3.1 introduces how to combine integer coefficient rings by explicit CRT computations.Section 3.2 introduces the NTT over integer rings and characterizes the NTT as the CRT for rings.Section 3.3 defines polynomial multiplication modulo x n − ψ.Section 3.4 introduces the discrete weighted transform for computing polynomial multiplication modulo x n −ζ n and the "twisting" from (mod x n −ζ n ) to (mod x n − 1).Section 3.5 discusses Cooley-Tukey and Gentleman-Sande fast Fourier transforms.Finally, Section 3.6 explains how to compute NTTs for NTT-unfriendly rings and Section 3.7 introduces incomplete NTTs.

Explicit Chinese remainder theorem computations
Explicitly computing a number from its remainders modulo a small number of coprime moduli q i is an "Explicit Chinese Remainder Theorem" computation.There are basically two known algorithms: [MS90, Theorem 23] which resembles Lagrangian interpolation, and [CHK + 21, Theorem 1] which resembles more divided-difference interpolation.

Explicit formulations for NTTs
In [AB74], the divisibility condition n|0(m) was established for NTTs over arbitrary To arrive at a definition more constructively, if n|0(m) then n is invertible in Z m and we can always choose a principal n-th root of unity ω giving NTT n:1:ω as follows [Für09] along with its inverse NTT −1 n:1:ω defined as below (where NTT −1 n:1:ω : A principal4 n-th root of unity ω is an n-th root of unity satisfying the orthogonality

Multi-moduli NTTs to save memory
There is an often overlooked implementation aspect of multi-moduli NTTs on the ARM Cortex-M4: Let q 0 and q 1 be coprime moduli for 16-bit NTTs, then we can compute an NTT over Z q0q1 .Due to M4's powerful 1-cycle long multiplications, a 32-bit NTT over q 0 q 1 easily outpaces 2 × 16-bit NTTs.Indeed 16-bit NTT < 32-bit NTT 2 × 16-bit NTTs in cycle counts.We can, thus, reduce stack usage without a huge sacrifice on performance.
For multiplying two size-n polynomials, if the coefficients of the result are smaller than a product of k 16-bit primes, then we only need 16(k + 1) × n/8 = 2n(k + 1) bytes of storage as follows.We first note that this memory usage can be achieved with k distinct 16-bit NTTs by interleaving the computation.However, the fact that one 32-bit NTT being significantly faster than two 16-bit NTTs means we should replace every two 16-bit NTTs with a 32-bit NTT.If k is odd, then we can process the multiplicands by computing k−1 2 32-bit NTTs and one 16-bit NTT for each.If k is even, for the first multiplicand, we compute k 2 32-bit NTTs and transform the last one into the result of two 16-bit NTTs, while for the second multiplicand, we compute k 2 − 1 32-bit NTTs and two 16-bit NTTs.

Prior uses of multi-moduli
Residue number system (RNS) is used in the context of homomorphic encryption for computing NTTs over primes p 0 , p 1 , . . ., p k−1 for speed.To use the Explicit CRT a la [MS90, Theorem 23], the representation is usually redundant.Here we use only two 16-bit prime moduli (non-redundantly) for reducing stack usage and jumping between the rings as shown in Section 4. In [HP21], the authors essentially used RNS to protect linear computation from side-channel attacks.They lift Z p0 to Z p0p1 , and compute NTTs over Z p0p1 for fault protection.Our approach is to switch to Z p0p1 for speed and to Z p0 and Z p1 for saving memory.We will detail when to switch which way later.

Polynomial multiplication
Let ψ ∈ Z m .Polynomial multiplication modulo x n − ψ means computing a(x)b(x) with the agreement that In Saber, we are computing negacyclic convolutions with n = 256.

Discrete weighted transform
We review how to apply the discrete weighted transform (DWT) to negacyclic convolutions, and in general, polynomial multiplication modulo x n − ζ n for an invertible ζ5 .In [CF94], DWT is given as "introducing a weight signal to compute weighted convolution".In our context, the weight signal is the sequence of powers 1, ζ, . . ., ζ n−1 of a scalar ζ [CF94, Equation (2.13)].So we will use the notation of NTT subscripted both with ζ and ω for this DWT.
Obviously, GS CT(a 0 , a 1 , c), c −1 = 2(a 0 , a 1 ) = CT GS(a 0 , a 1 , c), c −1 .This observation suggests that any computation composed of CT and GS butterflies can be inverted by inverting the CT and GS butterflies and then canceling the scaling by a power of 2.
CT for NTT over x 8 − 1 and over x 4 + 1.
(c) GS for NTT over x 8 − 1 and over x 4 + 1.There are at least two ways of implementing both n:1:ω described in the previous section.In this section, we fix n = 2 k , 2 k |0(m), and ω a principal 2 k -th root of unity.We describe the case where ζ only needs to be invertible.

CT for NTT and GS for iNTT
Computing NTT 2 k :ζ:ω with CT butterflies is mapping which, when applied recursively, results in the bit-reversal of

GS for NTT and CT for iNTT
Now we can invert with CT butterflies to derive the CT algorithm for NTT −1 n:ζ:ω .If ζ −1 = ±1, then we can absorb 2 k − 1 multiplications by 2 −k as shown in Figure 1.We implement the CT algorithm for NTT −1 n:ζ:ω on Cortex-M4.

NTT for NTT-unfriendly rings
For multiplying polynomials over finite integer rings not amiable for NTTs, since the coefficients of the result are bounded, we can choose a large NTT-friendly modulus to compute the result as in Z, and then reduce to the target coefficient ring [FSS20, CHK + 21].
For Saber, since we are multiplying a matrix by a vector with the polynomial modulus x 256 + 1, the resulting (signed) coefficients are within Therefore, if we choose a modulus q > 25165824 = 2 • 12582912 satisfying 2n|0(q ), we can compute the multiplication with length-n negacyclic NTTs in Z q .

NTTs for MatrixVectorMul
In this section, we describe how we compute NTTs for the MatrixVectorMul in Saber.Our main contribution is the use of multi-moduli NTTs enabling flexible time-memory trade-offs that have been not used for implementing Saber.For the unmasked implementation on Cortex-M4, we show how to mitigate the expansion of memory from 16-bit to 32-bit with NTTs at a relatively low cost.Our analysis shows that any algorithm not exploiting the negacyclic property requires the same amount of memory.For the masked implementation on Cortex-M4, we propose the use of unequal size NTTs for handling the big × big polynomial multiplications.On Cortex-M3, we propose two approaches.Our 32-bit NTT approach is applying non-constant-time computation to the public matrix for speed and constant-time computation whenever the secret data is involved.Our 16-bit NTT approach is a straight adaptation from the AVX2 implementation in [CHK + 21].
We implement all the known speed optimizations in the literature for Cortex-M4 and Cortex-M3.On Cortex-M4, our 32-bit butterfly is from [ACC + 21] and our 16-bit butterfly is from [ABCG20].We additionally find a slightly faster computation for the cyclic version used in the iNTT.The faster computation will be added in the eprint version.On Cortex-M3, our 16-bit and 32-bit butterflies are from [GKS21].For solving CRT, we follow the AVX2 implementation in [CHK + 21].We also implement all the known stack optimizations, including just-in-time generation of the public matrix and small storage for secret from [MKV20].
In Table 2, we give a summary of the implemented NTTs.On Cortex-M4, we implement incomplete NTT/iNTT with 6 layers of CT butterflies for all implementations.On Cortex-M3, we implement both a 32-bit approach and a 16-bit approach to find the optimal one.For the 32-bit approach, we implement complete NTT with 8 layers of CT butterflies and complete iNTT with 8 layers of GS butterflies.For the 16-bit approach, we implement incomplete NTT/iNTT with 6 layers of CT butterflies.

Reducing stack usage for MatrixVectorMul
The state-of-the-art Saber implementations [CHK + 21] using NTTs have thus far not been thoroughly optimized for minimal stack consumption.The authors exclusively optimized for speed and do not report any stack usage.Later, Van Beirendonck and Hwang refactored the implementation to reduce stack usage without degrading speed. 7In this section, we give a more thorough analysis of time-memory trade-offs.
The most memory-consuming operation in Saber is the MatrixVectorMul A T s in key generation and As in encryption.In all implementations, we employ on-the-fly generation of A, and consequently, only need one polynomial of A in memory.For computing A T s in key generation, we can compute the NTT for s on-the-fly but accumulate the entire result in the NTT domain with l accumulators.This is because the first component of the result only depends on the first column of A and the first component of s.For computing As during encryption, we compute the entire NTT of s with l polynomial buffers but hold only one buffer for accumulation.This is because a component of the result is an inner product of a row of A and s , and is computed in order.In summary, for computing A T s, the most memory-consuming part is the accumulation in the NTT domain.And for computing As , the most memory-consuming part is transforming s into the NTT domain.In the most speed-optimized and the most stack-optimized implementations, there is no downside to this.But they result in different time-memory trade-offs as shown below.
We now show that there are four ways for computing the product, which we will name strategies A, B, C, and D. They are distinguished by caching the NTTs of s or not and accumulating in the NTT domain or not.All four strategies apply to A T s and As .A is the fastest, and D consumes the least amount of memory.B and C run in comparable cycles but result in different degrees of trade-off for memory.For reducing the memory usage of A T s, B is much better than C since B effectively reduces the size of accumulators.On the other hand, for reducing the memory usage of As , C is much better than B, since C avoids caching the entire NTT(s ).On Cortex-M4, A corresponds to the implementation in [CHK + 21]; we additionally implement D for unmasked Saber, and A, C, and D for masked Saber.On Cortex-M3, we implement A for 32-bit NTT and strategies A, C, and D for 16-bit NTT.

Implementation on M4
For the simplicity of discussions, throughout this section, we assume ω is a principal 128-th root of unity so x 256 + 1 = x 256 − ω 64 .We illustrate our strategies only for MatrixVectorMul As in encryption.However, the ideas apply analogously for A T s in key generation.For the concrete evaluation of the stack usage, we use l to refer to the matrix dimension (l = 2 for LightSaber, l = 3 for Saber, and l = 4 for FireSaber).For our masked implementation, we refer to SABER_SHARES as the number of shares.Since our masked NTT multiplication is a generic multiplier for Saber, our code works for any masking order.However, the other parts of masked Saber from [VBDK + 20] only support first-order masking, and, hence, SABER_SHARES is always 2 in our experiments.
We exclusively use the Cooley-Tukey FFT to implement both the NTT and iNTT on the Cortex-M4.We recall the corresponding butterfly operations for 16-bit NTTs and 32-bit NTTs known from the literature [ABCG20, ACC + 21] in the following.

32-bit CT butterflies.
A straightforward implementation of 32-bit CT butterflies is using smull and smlal both giving 64-bit immediate results for a • (bR mod ± Q) and multiplication by Q with accumulation.A 32-bit CT butterfly is to proceed with addsub of (a 0 , ba 1 ) [ACC + 21].Although the 32-bit butterfly from [GKS21] gives the same functionality, we implement the 32-bit butterfly from [ACC + 21] for a smaller code size.

16-bit CT butterflies.
We implement CT butterflies with s{mul, mla}{b,t}{b,t}.Furthermore, we can use sadd16 and ssub16 to do add-sub pairs in parallel [ABCG20].
The workflow is outlined in Algorithm 4. We declare 16-bit arrays in the order of buff1_16, buff2_16, buff3_16 and 32-bit pointers * buff1_32 = (uint32_t * )buff1_16, * buff2_32 = (uint32_t * )buff2_16 so we can access the memory as 32-bit arrays at some point.First, we compute NTT (p0p1) (a(x)) and store the result to the 32-bit array buff1_32.We then compute and put buff1_32 mod p 1 in the 16-bit array buff3_16.For computing buff1_32 mod p 0 , we see that the result in buff1_32 won't be needed after reducing modp 0 , so we compute and put buff1_32 mod p 0 in the 16-bit array buff1_16.This is doable if we compute modp 0 from the beginning.We proceed with computing NTT (p1) (b(x)) in the 16-bit array buff2_16 followed by base_mul 64:4:ωp 1 :128 outputting to buff3_16, and computing NTT (p0) (b(x)) in the 16-bit array buff2_16 followed by base_mul 64:4:ωp 0 :128 outputting to buff2_16.Next we compute the explicit CRT, giving 32-bit coefficients as in the NTT domain with coefficient ring Z p0p1 , and put the result in the 32-bit array buff1_32.Finally, we compute NTT −1 (p0p1) and reduce the coefficient ring to Z q .Memory layout.For implementing stack optimized MatrixVectorMul in encapsulation of unmasked Saber, we employ a variant of Strategy D: we declare arrays uint16_t buff1_16[256], buff2_16[256], buff3_16[256], acc_16[256] multiply an element of A by an element of s with the above strategy, accumulate the result to acc_16, and finally derive an element of b .In total, only 1536 bytes are needed if the accumulator is excluded.
Comparison with previous stack optimized implementation.We compare the memory usage of polynomial multiplication to the currently most stack optimized implementation -4 levels of memory efficient Karatsuba [MKV20].Ignoring the extra O(log n) memory overhead for Karatsuba, we focus on the buffers for the multiplicands and the result.For Algorithm 4 16-bit (big, small) polynomial multiplication(s) using 1 536 bytes of memory.

Declare arrays uint16_t buff1_16[256], buff2_16[256], buff3_16[256]
Declare pointers uint32_t * buff1_32 = (uint32_t * )buff1_16 uint32_t * buff2_32 = (uint32_t * )buff1_16 the Karatsuba approach, one needs 512 bytes for the accumulator, 512 bytes for holding a component of A, and 1022 bytes for the degree-510 result -almost the same as the NTT approach with composite modulus.Essentially, any algorithm not exploiting the negacyclic property requires such amount of memory.We only find the work by [PC20] giving a non-NTT-based approach exploiting the negacyclic property, but the authors reported that they were not able to achieve a smaller footprint than the Karatsuba by [MKV20].

Masked MatrixVectorMul for Saber
A masked implementation of Saber decapsulation using Toom-Cook multiplication is given in [VBDK + 20].We improve this implementation by replacing MatrixVectorMul and InnerProd with NTT-based multiplications.As secret polynomials s and s are masked arithmetically modulo q, the multiplications are no longer big × small, but rather big × big, i.e., all input polynomials are in Z q [x]/ x 256 + 1 .Therefore, the coefficients of the product can be larger than 32-bit.This implies switching to an NTT-friendly 25-bit modulus and performing 32-bit NTTs no longer produces correct results.Instead, we propose combining a 32-bit NTT with a 16-bit NTT to compute the 48-bit value and then reduce each coefficient to Z q .We compute 32-bit NTT and 16-bit NTT by choosing p 0 = 44683393 = 349089 • 128 + 1 and p 1 = 769 = 6 • 128 + 1 as moduli.Their product q = p 0 p 1 = 44683393 • 769 = 34361529217 > 34359738368 = 2 • q 2 2 • 256 • 4 shows that after applying CRT, we derive the result as in Z.
For computing a(x)b(x) in Z q [x]/ x 256 + 1 , we compute a(x)b(x) in Z p0 [x]/ x 256 + 1 with 32-bit NTT and in Z p1 [x]/ x 256 + 1 with 16-bit NTT.Then, we apply CRT to obtain the result in Z q [x]/ x 256 + 1 which coincides with the result in Z[x]/ x 256 + 1 .Finally, we reduce the coefficient ring to Z q .
Memory layout for speed-optimized implementations.For implementing speed optimized MatrixVectorMul, we employ a shared variant of Strategy A, and declare arrays  .
For each share of s , we compute the 32-bit NTTs and 16-bit NTTs of it and store them in s_NTT_{32, 16}.For computing an element of shared b , we repeat the following l times: compute the 32-bit NTT and 16-bit NTT of an element of A; multiply them by the corresponding element of each share of s using base_mul 64:4:ωp 0 :128 and base_mul 64:4:ωp 1 :128 ; accumulate the results to accumulators acc_{32, 16}; compute the 32-bit iNTT and 16-bit iNTT for each share; and finally, solve CRT and reduce to Z q for each share.We repeat l times computing the shares of an element of b .For computing the shares of a polynomial product, we repeat l times for the following.We first expand an element of A and store it in buff_16.Then we compute the 32-bit NTT and in-place 16-bit NTT for the element and the result is stored in buff_{32, 16}.Next, we repeat SABER_SHARES times clearing the arrays s_NTT_{32, 16}, computing 32-bit NTT and 16-bit NTT of a share of s and storing them in s_NTT_{32, 16}, computing in-place base_mul 64:4:ωp 0 :128 and base_mul 64:4:ωp 1 :128 , in-place 32-bit iNTT and 16-bit iNTT, solving with CRT, and finally, accumulating the result to the corresponding share of acc_16.In total, 3072 bytes are needed if accumulators are excluded.

Declare arrays
Comparison with masked Toom-Cook.We first compare the stack usage.In [VBDK + 20], the polynomial multiplication is implemented as a Toom-4 followed by 2 levels of Karatsuba.Therefore, the memory usage for entire evaluation of one polynomial is 2 bytes.With carefully optimized accumulation, 3076 bytes are used.In total, 3588 bytes are needed because of the additional buffer of an element of A. For our stack optimized implementation, we only need 3072 bytes.Next we compare the number of NTTs computed in the speed optimized implementation.We compute 9 32-bit NTTs and 9 16-bit NTTs for A, 6 32-bit NTTs and 6 16-bit NTTs for the shared secret, 6 32-bit iNTTs and 6 16-bit iNTTs for the shared results.In summary, we need 15 32-bit NTTs, 15 16-bit NTTs, 6 32-bit iNTTs, and 6 16-bit iNTTs.Given that one 16-bit NTT takes 0.79× of one 32-bit NTT and one 16-bit iNTT takes 0.82× of one 32-bit iNTT, then essentially we need the equivalent of 26.85 32-bit NTTs and 10.92 32-bit iNTTs.Compared to [CHK + 21], we only need about 2.24× 32-bit NTTs and 3.64× 32-bit iNTTs, which is obviously faster than the shared variant of Toom-Cook.

Implementation on M3
Due to the more limited instruction set and the early terminating long multiplications on the Cortex-M3, the 32-bit butterflies from the previous section can only be used with some restrictions.In general, there are two approaches to still benefit from NTTs on the Cortex-M3: One can either implement 32-bit NTTs, but avoid the early terminating multiplication instructions for secret inputs, or one exclusively uses 16-bit NTTs and computes the CRT of the results.The former approach resembles the Cortex-M4 approach from [CHK + 21] and the previous section, while the latter is similar to the AVX2 implementation from [CHK + 21].We implement both approaches and compare their performance.We start by describing the butterfly implementations.For the 32-bit approach, we use CT for the NTT and GS for the iNTT, while for the 16-bit approach we use CT for both.

32-bit CT butterflies.
The 32-bit CT butterflies with smull and smlal are functionally correct on Cortex-M3.However, these instructions are early-terminating and can only be used when computing on public data.We denote the 5-instruction 32-bit butterflies as NTT_leak on Cortex-M3.For computing the NTT of the secret values s and s on Cortex-M3, we implement smull_const and smlal_const with radix-2 16 schoolbook multiplication as suggested in [GKS21].

32-bit GS butterflies.
As implemented for CT butterflies, we also use smull_const and smlal_const for 32-bit GS butterflies.After loading the coefficients as 32-bit values for the add-sub, we then split the result of a 0 − a 1 into halves for Montgomery multiplication.

16-bit CT butterflies.
A straightforward implementation of 16-bit CT butterflies is using mul and mla with sxth for extracting the lower 16 bits [GKS21].

32-bit NTT for MatrixVectorMul
We implement strategy A for MatrixVectorMul using 32-bit NTTs on Cortex-M3.An important observation is that A is public, so we can employ NTT_leak on A. This greatly improves the performance since among the l 2 +2l NTTs/iNTTs, l 2 of them are computation for A. On the other hand, the NTTs of secret and base_mul can only be computed with smull_const and smlal_const.We use the constant-time 32-bit CT and GS butterflies for the NTT and iNTT on secret data, respectively.Using smull_const and smlal_const leads to a much higher register pressure during the entire multiplication.Due to that, we do not benefit from using incomplete NTTs as the 2 × 2 base multiplication already exhausts the available registers.Therefore, we compute complete NTTs.

16-bit NTTs for MatrixVectorMul
We implement strategies A, C, and D with the 16-bit NTT approach for MatrixVectorMul on Cortex-M3.Our results show that the 16-bit approach is faster than the 32-bit approach.For strategy A, this corresponds to the AVX2 implementation from [CHK + 21].We also carry out the stack optimization on Cortex-M4 and implement strategies C and D.

A Note on combining 32-bit and 16-bit
There is an interesting observation when comparing the cycles of MatrixVectorMul: One 8-layer NTT_leak is only about 1.15× of two 6-layer 16-bit NTTs.This implies that 6-layer might be a faster approach.One first compute A with 6-layer NTT_leak, and then transform the result into two 16-bit NTTs with i → (i mod p 0 , i mod p 1 ).However, our experiments show that the performance gain with NTT_leak is canceled out by i → (i mod p 0 , i mod p 1 ).Therefore, we did not use this trick in our implementation.

Result
This section presents our results on the Cortex-M3 and Cortex-M4.We first describe our target platforms and setup and then present the results in Section 5.1.Section 5.2 evaluates the side-channel resistance of our masked implementation.
Cortex-M4 setup.We target the STM32F407-DISOVERY board featuring a STM32F407VG Cortex-M4 microcontroller with 196 kB of SRAM and 1 MB of flash.Our benchmarking setup is based on pqm4 [KRSS]; we clock the core at 24 MHz with no flash wait states.
Cortex-M3 setup.Our Cortex-M3 target platform is the Nucleo-F207ZG board containing a STM32F207ZG core with 128 kB of SRAM and 1 MB of flash.Our benchmarking setup is based on pqm3. 8We clock the core at 30 MHz to avoid having flash wait states.

Keccak and Randomness.
For both implementations, we use the ARMv7-M assembly implementation of Keccak from the XKCP9 which is operational on the Cortex-M3 and the Cortex-M4.This implementation is also contained in both pqm3 and pqm4.For randomness required in key generation and encapsulation, we use the hardware RNG.
All code is compiled with arm-none-eabi-gcc Version 10.2.0 with -O3.For each of the first three columns, the cycles for a polynomial multiplication will be 2 • NTT(or NTT + NTT_leak) + NTT −1 + base_mul + CRT(if not -).The NTT of the column 32-bit + 16-bit contains a layer of sbfx to reduce elements to Z q .For the last two columns, they together implement a polynomial multiplication, and the cycles are the sum of the two columns.One of the 16-bit base_mul is preceded with modular reduction to save load and store instructions.For the stack usage, the first three columns are for a polynomial multiplication.We report results for a single polynomial multiplication in Table 3.Each of the first three columns realizes a polynomial multiplication as computing NTTs on inputs, base_mul, and finally NTT −1 (followed by CRT if needed).For the last two columns, they together realize a polynomial multiplication as computing one 32-bit NTT, two 16-bit NTTs, two 16-bit base_muls, one CRT giving a 32-bit polynomial, and finally one 32-bit NTT −1 .

Performance
We report results of our implementations of unmasked Saber as shown in Table 4.For the ARM Cortex-M3, our speed-optimized NTT implementation of (unmasked) Saber requires only 65.0%-70.7% of the time and 45.0%-51.2% of stack space compares to the Toom-Cook implementation available in pqm3.Our stack-optimized implementation is still 5.6%-13.0%faster while requiring 70.3%-79.9%less stack space.For the our stack-optimized implementation requires about the same or slightly less amount of stack while achieving a vast speed-up compared to the stack-optimized Saber from [MKV20].The results of masked decapsulation of Saber are shown in Table 5.We also report the overhead of cycles and stack usage in Table 6.Our speed-optimized approach is outperforming Toom-Cook by 15.4%.Our stack-optimized approach is using 72.3% of the stack of Toom-Cook, and is only a little slower than Toom-Cook.In trading speed for memory, we implement strategy C, outperforming Toom-Cook in both speed and memory.

Leakage Evaluation of Masked MatrixVectorMul in Saber
We adopt the Test Vector Leakage Assessment (TVLA) methodology to perform leakage detection.We made use of CW1173 ChipWhisperer-Lite [Newb] to collect the power consumption traces at a sampling rate of 59.04 MS/s.The target board is CW308 UFO [Newc] with ChipWhisperer platform -CW308_STM32F4 (ST Micro STM32F405) [Newa] on which we run our implementations at the frequency of 7.38 MHz.We focus on the key decapsulation and capture three sets of power traces corresponding to the test vectors in Table 7 [ISO16].Then, compute Welch's t-test to identify the differentiating features between Set 1 and Set 2, and between Set 1 and Set 3. Fixed secret key, Randomly-chosen ciphertexts Set 3 Randomly-chosen secret keys, Fixed ciphertext The maximum number of samples on the CW1173 ChipWhisperer-Lite is 24573 [Newb].Thus, we cannot capture the whole power trace of a full Saber decapsulation.In our experiment we only capture traces of the power consumption toward the beginning of the key decapsulation, which is an inner product of polynomial multiplications between ciphertext and the secret key, which is implemented using the NTT.There are four steps: NTT of the ciphertext, NTT of the secret key, base multiplication, and the iNTT.
In the first experiment, we do the TVLA on the power traces of Set 1 and Set 2, which correspond to the randomly-chosen ciphertexts and fixed-chosen ciphertexts with a fixed secret key.In the second step, doing the NTT of the secret key, there is no leakage, which is expected since the secret key is fixed in our first experiment.The first and the third steps, doing the NTT of ciphertext and base multiplications between the NTT results of ciphertext and the secret key, show leakage, which is expected since the ciphertext is public information.After the base multiplication, finally, the inverse NTT shows no leakage in the protected version.By contrast, there is leakage in the unprotected version.Figure 3a and Figure 3b show the t-tests of unprotected Saber and masked Saber on power traces of Set 1 and Set 2. Each figure can be separated into two parts by the black lines: 1. doing base multiplication between the NTT of ciphertext and the NTT of the secret key; 2. doing the inverse We can see that the t-statistic value of masked Saber is inside the ±4.5 [WO19] interval (red line) for all the points in time during the NTT −1 , which implies that the protected implementation is secure against first-order attacks.
In addition, the t-statistic value of the first part in Figure 3b is outside the ±4.5 interval, since one of the multiplicands of base multiplication, ciphertext, is a public value.
In the second experiment, we do the TVLA on the power traces of Set 1 and Set 3, which correspond to the randomly-chosen secret keys and fixed-chosen secret keys with a fixed ciphertext.In the second step, doing the NTT of the secret key shows no leakage in the protected version.By contrast, there is leakage in the unprotected version.Figure 4a and Figure 4b show the t-tests of unprotected Saber and masked Saber on power traces of Set 1 and Set 3.Each figure can be separated into two parts by the black lines: 1. doing the NTT of ciphertext; 2. doing the NTT of the secret key.We can see that the t-statistic value of the masked Saber is inside the ±4.5 [WO19] interval, the red lines in the figures, for all the points in time during the NTT, which implies that the protected implementation is secure against first-order attacks.
Our masked Saber implementation as described in Section 4.2.2only differs from [VBDK + 20] in MatrixVectorMul and InnerProd.Hence, the masked Keccak implementation remains unchanged.To verify that this implementation is indeed secure, we perform another set of experiments targeting the beginning of the SHA3-512 function, which is the absorb step in the Keccak sponge construction.Then, we do the TVLA on the power traces of Set 1 and Set 2, which correspond to the randomly-chosen ciphertexts and fixed-chosen ciphertexts with a fixed secret key.In masked Saber, turning the masks on or off can activate or deactivate the countermeasure.Figure 5a and Figure 5b show the t-tests of Keccak implementation in masked Saber on power traces of Set 1 and Set 2 with masks off and with masks on, respectively.We can see that the t-statistic value of the masked Saber with masks on is inside the ±4.5 [WO19] interval (the red lines in the figures) for all the points.It means that the masked Saber implementation is secure against first-order attacks when the masks are on.

.
All our implementation are open source and available at https://github.com/multi-moduli-ntt-saber/multi-moduli-ntt-saber.Related work.There is a line of work optimizing Saber for the Cortex-M4 [KRS19, MKV20, CHK + 21] using Karatsuba, Toom-Cook, and lately also NTTs.A masked Saber is presented by Van Beirendonck et al in [VBDK + 20].Other NISTPQC third-round candidates have been implemented for the Cortex-M3 and M4.The ones most relevant to us are the constant-time NTTs from Greconici et al. [GKS21] and the stack optimizations by Botros et al. [BKS19].Composite modulus NTTs were earlier studied in the context of side-channel protections for lattice-based schemes by Heinz and Pöppelmann [HP21].
root of unity.By setting ω = ζ 2 , the negacyclic NTTs of Kyber and Dilithium, which are exactly the upper halves of standard NTTs, are special cases of NTT n:ζ:ω and NTT −1 n:ζ:ω .But notice that our definitions are more generic as in [CF94] because we simply aim to compute negacyclic convolutions. 6Additionally, by setting ζ = 1, one can obtain the cyclic versions NTT n:1:ω and NTT −1 n:1

A.
We cache NTT(s) and accumulate values in the NTT domain; B. we cache NTT(s) and accumulate values in the normal domain; C. we re-compute NTT(s) and accumulate values in the NTT domain; D. we re-compute NTT(s) and accumulate values in the normal domain.

Figure 3 :
Figure 3: T-test results on traces of Set 1 and 2

Figure 4 :
Figure 4: T-test results on traces of Set 1 and 3

Figure 5 :
Figure 5: T-test of Keccak implementation in masked Saber on traces Set 1 and 2

Table 2 :
Summary of NTT approaches.This section is organized as follows: First, we analyze strategies for reducing stack usage of MatrixVectorMul in Section 4.1.Next, we go through our implementations on Cortex-M4 in Section 4.2: our stack-optimized implementation for unmasked Saber in Section 4.2.1, and speed-optimized and stack-optimized implementations for masked Saber in Section 4.2.2.Finally, we present our implementation on Cortex-M3 in Section 4.3, covering 32-bit NTT in Section 4.3.1, and 16-bit NTT in Section 4.3.2.
stack optimized implementations.For implementing stack optimized MatrixVectorMul, we employ a shared variant of Strategy D, and declare arrays
The stack usage of the last two columns are the bytes occupied by the functions.But the actual stack usage is 1 536 bytes, since the arrays are overlapped.

on joint implementation with Kyber NTT optimized with stack and program size.
Due to the flexibility of choosing moduli, one can share the 16-bit NTT implementations between Kyber and Saber.But we do not recommend this.For joint implementation in software, neither Kyber nor Saber will be optimal for the following reasons: (1) The Kyber NTT is 7 layers, while the optimal NTT for Saber is 6 layers; (2) Saber requires two 16-bit primes where their product must be larger than 25165824.The smallest suitable prime are 3329 and 7681.The first reason implies MatrixVectorMul for Saber is suboptimal, and the second reason implies more reductions are required for NTT of Kyber since 7681 > 3329.

Table 7 :
Test Vectors of Saber for captured power traces