Compact Dilithium Implementations on Cortex-M3 and Cortex-M4

. We present implementations of the lattice-based digital signature scheme Dilithium for ARM Cortex-M3 and ARM Cortex-M4. Dilithium is one of the three signature ﬁnalists of the NIST post-quantum cryptography competition. As our Cortex-M4 target, we use the popular STM32F407-DISCOVERY development board. Compared to the previous speed records on the Cortex-M4 by Ravi, Gupta, Chat-topadhyay, and Bhasin we speed up the key operations NTT and NTT − 1 by 20% which together with other optimizations results in speedups of 7%, 15%, and 9% for Dilithium3 key generation, signing, and veriﬁcation respectively. We also present the ﬁrst constant-time Dilithium implementation on the Cortex-M3 and use the Arduino Due for benchmarks. For Dilithium3 , we achieve on average 2 562 kilocycles for key generation, 10 667 kilocycles for signing, and 2 321 kilocycles for veriﬁcation. Additionally, we present stack consumption optimizations applying to both our Cortex-M3 and Cortex-M4 implementation. Due to the iterative nature of the Dilithium signing algorithm, there is no optimal way to achieve the best speed and lowest stack consumption at the same time. We present three diﬀerent strategies for the signing procedure which allow trading more stack and ﬂash memory for faster speed or vice-versa. Our implementation of Dilithium3 with the smallest memory footprint uses less than 12kB. As an additional output of this work, we present the ﬁrst Cortex-M3 implementations of the key-encapsulation schemes NewHope and Kyber.


Introduction
In 2016, NIST called for proposals for new post-quantum schemes [NIS16] which are meant to replace the existing standards for key establishment (SP 800-56A [BCR + 18] and SP 800-56B [BCR + 19]) and digital signatures (FIPS 186-4 [Nat13]).While the existing standards are based on the hardness of integer factorization and computing discrete logarithms and are, therefore, broken by Shor's algorithm [Sho94], the new ones should resist adversaries with access to a large-scale quantum computer.After receiving 69 submissions in 2017, NIST narrowed down to 26 schemes advancing to round two in 2019.In July 2020, NIST announced their selection of seven finalists which are to be evaluated in a third round.Out of those finalists NIST expressed their intention to standardize a subset at the end of round three.Additionally, NIST announced eight alternative schemes which may still be standardized at a later point.The seven finalists include the four key establishment schemes Classic McEliece, Kyber, NTRU, and Saber as well as the three signature schemes Dilithium, Falcon, and Rainbow.
One major family of post-quantum cryptographic schemes are based on hard lattice problems.Out of the seven finalists, five are lattice-based.Together with Falcon [PFH + 19], Dilithium [LDK + 19] is one of the two remaining lattice-based signature schemes.In round two, an additional lattice-based signature scheme called qTesla [BAA + 19].However, qTesla was not selected to advance to the third round.
Dilithium and qTesla are conceptually very similar to they are both Fiat-Shamirwith-abort schemes [Lyu09] based on {R,M}LWE and {R,M}SIS.However, Dilithium has significantly smaller keys, smaller signatures, and better performance.For example, qTESLA-p-I public keys are 14 880 bytes and signatures are 2 592 bytes, while Dilithium2 has 1 184 byte public keys and 2 044 byte signatures albeit providing the same level of (claimed) security.Note that qTesla initially also proposed "heuristic" parameter sets which achieved sizes and performance closer to Dilithium, but the qTesla team withdrew those parameter sets because Lyubashevsky and Schwabe presented a complete break allowing universal forgeries 1 .Falcon, on the other hand, is a "hash-and-sign" signature schemes based on NTRU lattices and, hence, has different characteristics while being competitive with Dilithium in terms of message sizes and computational performance.
Together with the finalist key-encapsulation mechanism Kyber [ABD + 19], Dilithium is part of the Cryptographic Suite for Algebraic Lattices (CRYSTALS).Both Dilithium and Kyber use structured lattices to allow fast arithmetic and compact key, signature, and ciphertext sizes.Both make use of the polynomial ring Z q [X]/(X 256 + 1) which enables efficient polynomial multiplication using the number theoretic transform (NTT).However, Dilithium is using a 23-bit prime modulus, while Kyber is using a 12-bit prime modulus which means that their implementations of polynomial arithmetic differ significantly.
While there is a vast literature on the implementation of lattice-based key-encapsulation schemes, the coverage of lattice-based signatures is still limited and more research is needed.We advance the field by presenting optimized implementations of Dilithium for the ARM Cortex-M3 and ARM Cortex-M4.In this work, aside from optimizing for speed, we also optimize for stack usage.
The Cortex-M4 has been declared the main microcontroller optimization target for the post-quantum competition by NIST and, hence, the majority of schemes in the third round have an optimized implementation for that architecture.However, its "smaller brother", the Cortex-M3, is also still widely deployed.
The Cortex-M4 provides various advanced instructions for optimizing cryptographic schemes which might be one of the reasons why it has received much attention from the cryptographic community.
However, the Cortex-M3 comes with one "feature" which does appear interesting from an implementation and also from a side channel perspective: Different from the Cortex-M4, it does not have a constant-cycle 32-bit multiplier producing a 64-bit result, but only a variable-cycle one.Therefore, an implementation of any scheme working on large (secret) integers compiled for the Cortex-M3 is most likely going to leak information about these secret integers via timing side channels.This has been shown to pose a problem for cryptographic schemes in preceding ARM architectures [GOPT09].This is particularly interesting for Dilithium, because of the large prime modulus q = 8380417.If existing implementations for Dilithium are simply compiled for the Cortex-M3, they are very likely to be vulnerable to timing attacks within the polynomial multiplication.In this paper, we build a safe constant-time implementation of Dilithium on the Cortex-M3.That is, the execution time of the algorithm is invariant over all the secret values in the algorithm.q = 2 23 − 2 13 + 1 = 8380417 and the polynomial dimension is n = 256.Furthermore, for all parameter sets the bound γ 1 is set to (q − 1)/16 = 523776 and γ 2 = γ 1 /2 = 261888.For each parameter set, the remaining parameters and the resulting public key and signature sizes are given in Table 1.The parameters consist of the matrix dimension (k, ), the sampling bounds of the secret η, and the rejection thresholds β and ω.The Dilithium signature generation algorithm uses rejection sampling to find a signature that can be both correctly verified and does not leak information about the secret key.
(z, h) = ⊥ domain) in quasi-linear time using a Fast Fourier transform algorithm (FFT).The multiplication in the NTT-domain is coefficient-wise multiplication and, hence, has linear run-time.To transform back to the regular domain (or time domain), the inverse NTT (denoted as NTT −1 from here on) is computed which again can be implemented in quasi-linear time.A full polynomial multiplication can be performed as NTT −1 (NTT(a) • NTT(b)).While NTT-based multiplication itself does achieve superior performance on some platforms over other multiplication methods, the advantage is even bigger when either argument is already in NTT-domain or, alternatively, the output can remain in NTT-domain.
For a polynomial a = n−1 i=0 a i X i , the Dilithium NTT is defined as where r is a 2n-th primitive root of unity modulo q.Dilithium uses the 512-th primitive root of unity r = 1753.

Algorithm 3 Dilithium verification
Input: Public key pk = (ρ, t 1 ) The NTT −1 is defined as Dilithium implementations usually use the Cooley-Tukey (CT) FFT algorithm [CT65] in the forward NTT and the Gentleman-Sande (GS) FFT algorithm [GS66] in the NTT −1 .These algorithms implement the NTT in quasi-linear time, and make use of log n layers of n/2 Cooley-Tukey or Gentleman-Sande "butterflies" which we will introduce in Section 3.

Target Platforms: Cortex-M3 and Cortex-M4
NIST has stated that performance will play an important role in the evaluation of schemes beyond the first round2 .As a primary microcontroller optimization target, NIST recommends the use of the Cortex-M4 board with all options included.Consequently, previous work on microcontroller implementations of Dilithium [GKOS18,RGCB19] has primarily focused on the Cortex-M4.Particularly, it has been targeting the STM32F407 core which was popularized for post-quantum cryptography by the testing and benchmarking framework pqm4 [KRSS].For our Cortex-M4 optimization we target the same core and board so that we can report comparable results.

Cortex-M4.
The Cortex-M4 implements the ARMv7E-M [ARM14] instruction set architecture (ISA).The core we use is the STM32F407 which provides 196 KiB of RAM (of which 128 KiB are contiguous) and 1 MiB of flash, and it runs at a maximum frequency of 168 MHz.One feature of the ARMv7E-M instruction set makes it particularly interesting for Dilithium optimizations: the large single-cycle multiplier implementing the instructions UMULL, SMULL, UMLAL, and SMLAL.Those allow computing the 64-bit product of two 32-bit arguments and, in the case of UMLAL and SMLAL, adding the result to a 64-bit accumulator.On the Cortex-M4 microarchitecture, these instructions execute in a single cycle.Those prove particularly useful for the Dilithium polynomial multiplication.Due to the large modulus (23-bit) in Dilithium, those multiplications do require computing 64-bit products.
Another feature of the ARMv7E-M ISA are SIMD instructions like SMLAD or UADD16 which have been shown to achieve significant speedups for NTT-based polynomial multiplication [BKS19, ABCG20] and Toom-Cook-based polynomial multiplication [KRS19, KBM-SRV18, BMKV20] on the Cortex-M4.While those provide vast speedups for schemes with small moduli (<16-bits) like Kyber [ABD + 19], Saber [DKRV19], or NewHope [PAA + 19], they do not help when working with larger moduli like the one from Dilithium.Hence, other optimization strategies are needed.Cortex-M3.One issue arising when using the UMULL, SMULL, UMLAL, and SMLAL is that while they are single-cycle (i.e., constant-time) on the Cortex-M4, they are not single-cycle on all ARM cores.Particularly, the Cortex-M3 (implementing the ARMv7-M) ISA also provides these instructions, but on this platform UMULL and SMULL take 3 to 5 cycles to execute and UMLAL and SMLAL take 4 to 7 cycles.As there is no authoritative information available on the early-termination conditions for these variable-time instructions, it appears dangerous to use these instructions in code that needs to be constant-time.The earlytermination conditions have been reverse engineered by de Groot [dG15], who showed that there appear to be four properties that cause an early termination: (1) Arguments being zero; (2) arguments being smaller than 16-bits; (3) top-heavy arguments (i.e., zero in the least significant 16-bits); or (4) arguments being a power of two.
Previous work by Großschädl-Oswald-Page-Tunstall [GOPT09] evaluated early-terminating multiplication instructions on ARMv3 microcontrollers.They propose a constant-time multiplication algorithm which still uses the variable-time multiplication instructions, but avoids any shortcuts from being taken.Unfortunately, the newer ARMv7-M ISA appears to have vastly more sophisticated shortcuts and it appears unlikely that all shortcuts can be avoided at reasonable cost.In addition, the shortcuts identified by de Groot [dG15] are not actually confirmed by ARM.It is, hence, possible that not all shortcuts are known or that the shortcuts do not apply to all Cortex-M3 chips.Therefore, when writing constant-time code those instructions should be avoided, which means only the multiplication instructions MUL and MLA can be used which only compute the lower 32 bits of the 64 bit product.This presents a challenge for implementing the Dilithium polynomial multiplication.This issue was not addressed by previous work on Dilithium implementations on the Cortex-M3 [MGTF19].Instead, Migliore, Gérard, Tibouchi, and Fouque propose a modified Dilithium with the power-of-two modulus q = 2 32 to allow for cheaper masking.As a side-effect of this proposed change, multiplications can be done using MUL, MLS, and MLA as those implicitly reduce modulo 2 32 , and, hence, constant-time implementations are more straightforward.
The Cortex-M3 platform we use is the popular Arduino Due which comes with a Atmel SAM3X8E core, has 96 KiB of RAM, 512 KiB of Flash, and runs at a maximum frequency of 84 MHz.

Improving the Performance on Cortex-M4
Our Cortex-M4 implementation is based on the Dilithium implementation by Ravi, Gupta, Chattopadhyay, and Bhasin [RGCB19], which includes the NTT and NTT −1 assembly implementation of Güneysu, Krausz, Oder, and Speith [GKOS18].
The number theoretic transform is implemented using the divide-and-conquer approach of the Cooley-Tukey and Gentleman-Sande FFT algorithms.While these are strictly "fast Fourier-transform" (FFT) algorithms, they are used in Dilithium to compute the "number theoretic transform" (NTT).In this section it is often the case that both terms apply.Both the CT and GS algorithms split the input vector into two halves: the Cooley-Tukey butterfly breaks the input vector into its even-indexed values and its odd-indexed values; the Gentleman-Sande splits the input vector in the middle of the vector, yielding the first half and the last half of the original vector.After splitting the input vector, the final NTT is computed by applying the NTT over each smaller vector and then combining the results according to the CT or GS methods defined below.
In the Cooley-Tukey algorithm, the smaller NTT-transformed vectors y (from the even indices) and z (from the odd indices) are used to compute the larger vector using and in the Gentleman-Sande algorithm, the smaller NTT-transformed vectors y (from the low indices) and z (from the high indices) construct the larger vector x using where, for a vector size of m, the value r m denotes the primitive mth root of unity modulo q.
For the purpose of illustration, we show these formulas in the form of "butterfly diagrams" in Figure 1.In the literature, the powers of r m are known as twiddle factors, or twiddles for short.In these FFT algorithms, the NTTs of the smaller vectors can themselves again be computed using the CT and GS algorithms.Indeed, we can recursively apply these algorithms until we are left with vectors of size 1, for which Because the size of the Dilithium polynomial n = 256 is a power of two, the algorithm will be applied log 2 n = 8 times until we reach the base case.
In Dilithium, the NTT and NTT −1 are computed iteratively and in-place, such that no auxiliary vectors are required to store intermediate results.For computing the NTT, Dilithium uses such an iterative Cooley-Tukey algorithm, which takes its input vector in normal order, and outputs the vector in bit-reversed order.The NTT −1 is implemented using an iterative Gentleman-Sande algorithm, which takes its input vector in bit-reversed order and returns a vector in normal order.Note that this has no effect on the polynomialmultiplication property (using coefficient-wise multiplication), as described in Section 2.
In our implementation similarly to previous work, we precompute and store the twiddle factors in flash.The twiddle factors are stored in Montgomery domain (with modulus R = 2 32 ), such that after the multiplication in the FFT butterfly, we can use Montgomery reduction [Mon85] to reduce the product modulo q.
After each level of the NTT and NTT −1 , the polynomial coefficients are growing in size due to additions and subtractions.Intuitively we would apply a modular reduction after each addition/subtraction operation.However, the coefficients in the input polynomial are bounded by 2q (which is only 24 bits) and even if we do not reduce mod q after each level, we will not overflow the 32-bit registers in which we store the coefficients.Therefore, we reduce each coefficient mod q only once, at the end of the NTT and NTT −1 .This technique of delaying the reduction is usually referred to as lazy reduction.
When implementing the NTT and NTT −1 , we first unroll the outer loop which iterates over the 8 levels of the NTT and NTT −1 .Furthermore, similar to the merging technique in [GOPS13], we can merge two levels of the NTT and NTT −1 on Cortex-M4 ({0,1}, {2,3}, {4,5} and {6,7}).Merging k layers here means that instead of loading two coefficients, Listing 1 CT butterfly from [GKOS18] ; q=8380417, qinv=4236238847 one loads the 2 k coefficients which are used together in k consecutive layers.By doing so on can eliminate the load and store operations between the layers.Hence, the number of layers that can be merged is bounded by the available registers.For our implementation, we achieved the best performance by merging two layers.As a consequence, the number of store and load instructions is reduced with a factor of 2. Lastly the main difference which distinguishes our implementation from the one published in [GKOS18] is changing the polynomial coefficients to signed representation.When unsigned integers are subtracted from each other, it is possible for the result wrap around zero (when the result would be negative).To prevent this overflow, the subtractions in the reference implementation are accompanied by an addition with a multiple of q, pushing the results back into the positive domain.By switching to the signed representation, the problem of negative overflows is fixed, and we do not need this extra multiple-of-q addition.Therefore, switching to signed representation allows us to eliminate all these additions throughout the code.This is especially relevant for the NTT and NTT −1 implementations, because every butterfly operation has a subtraction.Listing 2 shows our improvements to the CT butterfly in the NTT by [GKOS18] which is shown in Listing 1.For the GS butterflies in the NTT −1 , the improvements are listed in Listings 3 and 4.
However, the overflow-mitigating additions were not only present in the NTT, but also in the sampling of s 1 , s 2 , and y, polynomial subtraction, and unpacking operations throughout the scheme.By switching to signed representation, we did not only improve the performance of the NTT, but also of all the other routines listed above.
Finally, in addition to improving the NTT and NTT −1 , we rewrote the pointwise polynomial multiplication, uniform sampling of polynomials, and polynomial reduction in assembly as these were the most expensive operations besides the already optimized NTT, NTT −1 , and hashing operations using Keccak.We omit the details, as they result straightforward from the reference code.

Fast Constant-Time NTTs on Cortex-M3
Our constant-time Cortex-M3 implementation of Dilithium is based on the Cortex-M4 implementation described in the previous section.To keep this section concise, we only describe the differences here, which are mainly in order to make the implementation constant-time.When compiling the existing implementation [GKOS18] for the Cortex-M3, we identify three functions which make use of the variable-time instructions UMULL and UMLAL: NTT, NTT −1 , and pointwise multiplication (•).These functions are the only ones that involve the multiplication of the 32-bit coefficients of polynomials.When any of them operates on secret data, it will leak information through a timing side channel.
Previous work by [MGTF19] suggests that the reference implementation of Dilithium is constant time.This is however untrue for Cortex-M3, because the compiler is in no way prevented from emitting any of the variable-time instructions.In their paper, the authors propose a modified Dilithium with a power-of-two modulus q = 2 32 to allow for cheaper masking.As a side-effect of this proposed change, multiplications can be done using MUL, MLS, and MLA as those implicitly wrap their results modulo 2 32 .In that case, implementing Dilithium in constant-time is more straightforward.
Interestingly, many of the operations within Dilithium do not handle secret data, and, hence, do not need to be constant time.Particularly, all operations in the signature verification (Algorithm 3) are only operating on public data and can, therefore, be implemented in variable time.Similarly, in signature generation (Algorithm 2) NTT(t 0 ) (line 5), NTT(H B (µ, w 1 )) (line 10), and NTT −1 (ĉ • t0 ) (line 16 and 17) are not processing secret data as both t and c are considered public.For the details we refer to the security proof in [LDK + 19, Section 5].The remaining calls to NTT, NTT −1 , and • do process secret data.Similarly, all operations in the key generation of Dilithium (Algorithm 1) have secret inputs.In our implementation we provide both a constant-time and variable-time (leaktime) implementation implementations of NTT, NTT −1 , and •.Because the variabletime implementations are significantly faster, we prefer using them over the constant-time implementations when we are only dealing with public data.
Note that, in theory, the compiler could introduce UMULL, UMLAL, SMULL, and SMULL instructions in other parts of the code as well.Since there is no easy way to prevent compilers (gcc and clang) from emitting those instructions, we instead carefully analyze the assembly generated by the compiler to not contain these instructions in functions that are safe to leak.We add the suffix _leaktime to the names of variable-time functions only operating on public data to support this analysis.
The remainder of this section describes the necessary changes to the Cortex-M4 implementation to ensure it executes in constant-time on the Cortex-M3.We describe the details from the bottom up, i.e., we start with the multiplication of coefficients, continue with the changes to the implementations of the Cooley-Tukey and Gentleman-Sande butterfly operations, and finally describe the changes to the NTT, NTT −1 and the rest of the scheme.

SMULL and SMLAL
As Dilithium uses a 23-bit modulus q, its polynomials are usually represented as vectors of 32-bit values.Consequently, multiplying coefficients requires multiplication of 32-bit values producing a 64-bit product.Usually, Montgomery multiplication is used, so that the result is promptly reduced back to 32-bits.In our Cortex-M4 implementation the Montgomery multiplication is computed using SMULL and SMLAL, which-as already discussed-execute in variable-time on the Cortex-M3.In case the inputs are secret, we cannot use those instructions.
In general, there are two approaches to address this issue: either re-implement SMULL and SMLAL using available constant-time instructions (MUL, MLA, ADD) or using a different representation of polynomials that does not require to multiply 32-bit coefficients.We experimented with the latter approach by using multiple smaller 16-bit polynomial multiplications to construct a larger 23-bit polynomial multiplication.The idea is to perform polynomial multiplications in R q by first splitting up the polynomial into multiple polynomials in Z qi /(X n + 1), performing the polynomial multiplication in these smaller rings and then reconstructing the result in R q using the explicit Chinese remainder theorem [BS07].A similar approach is used in the AVX2 implementation of NTRUPrime [BCLv19].For the result to be correct, it needs to hold that 2n • q/2 2 < q i .For example, one could use the NTT-friendly primes {7681, 10753, 11777, 12289}.However, this approach turned out to be slower than re-implementing the SMULL and SMLAL instructions using MUL instructions, and hence we did not use it in our implementation.Nonetheless, we present results for 16-bit NTTs on the Cortex-M3 for the primes 3329 and 12289 which are used in the NIST key-encapsulation candidates Kyber [ABD + 19] and NewHope [PAA + 19] respectively.We report the results for the full schemes in Appendix A.
To re-implement SMULL and SMLAL, we use the schoolbook approach, i.e., we represent the 32-bit inputs in radix 2 16 and compute the product as sums of 32-bit products.Let a = 2 16 a 1 + a 0 and b = 2 16 b 1 + b 0 , with 0 ≤ a 0 , b 0 < 2 16 and −2 15 ≤ a 1 , b 1 < 2 15 , then the product ab = 2 32 a 1 b 1 + 2 16 (a 0 b 1 + a 1 b 0 ) + a 0 b 0 , with −2 31 ≤ a i b j < 2 31 .Accordingly, our constant-time assembly implementations for SMULL and SMLAL are illustrated in Listing 5 and Listing 6.We denote them by SBSMULL and SBSMLAL in the following.The four 16-bit halves of the two multiplicands are passed in the registers a 0 , a 1 , b 0 , and b 1 ; the 64-bit output is placed in c 0 (lower half) and c 1 (upper half).For SMLAL, c 0 and c 1 initially contain the value to be added to the product.On the Cortex-M3, additions and multiplications use 1 cycle, while MLA uses 2 cycles.As such, the SBSMULL macro takes 7 cycles to execute, while SBSMLAL takes 9 cycles.
It is important to note that SBSMULL (SBSMLAL) is not semantically equivalent to SMULL (SMLAL).In case the accumulation (a 0 b 1 + a 1 b 0 ) in line 7 of Listing 5 or line 11 of Listing 6 overflows, the carry bit is lost and the result will not be correct.Hence, our schoolbook multiplication does not support the full 32-bit range of the inputs.In general, we have to 1.One of the factors (say b) is small, e.g., a twiddle factor (< q) or the constant q.In that case b 1 is at most q 2 16 = 127.In the worst case both b 0 and a 0 are equal to 2 16 − 1.Consequently, for the addition (a 0 b 1 + a 1 b 0 ) not to overflow, a 1 can be at most2 31 −1−127•(2 16 −1) 2. Both multiplicands can be equally large.This occurs, for example, in the pointwise polynomial multiplication.In that case both a 0 b 1 and a 1 b 0 need to be less or equal to 2 31 −1

Cooley-Tukey and Gentleman-Sande Butterflies
Using constant-time SBSMULL and SBSMLAL sub-routines, we can construct the butterfly operations needed to implement the NTT and NTT −1 .Listing 7 depicts the modified Cooley-Tukey butterfly operation based on Listing 2. To be able to use SBSMULL, p 1 and the twiddle factor needs to be loaded in half-words, while p 0 can be loaded as a 32-bit word.For the multiplication by q, we require to have the lower and the upper half-word of q separately.Additionally, we need to split up the 32-bit result of the multiplication by −q −1 into half-words (lines 8 and 9).In total, the Cooley-Tukey butterfly operation requires 21 cycles on the Cortex-M3, while Listing 2 only needs 5 cycles on the Cortex-M4.
Similarly, Listing 8 depicts our constant-time assembly implementation of the Gentleman-Sande butterfly.As the addition and subtraction happens before the multiplication by the twiddle factor, both p 0 and p 1 are loaded as full 32-bit words, while the twiddle factor is again split into two half words.After the subtraction in line 5, we split up the result before we pass it into SBSMULL.To perform the Montgomery reduction, we again need the split up the result of the multiplication by −q −1 into halves, before multiplying it by q using SBSMLAL.Each Gentleman-Sande butterfly operation requires 23 cycles on the Cortex-M3 which compares to 5 cycles for Listing 4 on the Cortex-M4.

NTT, NTT −1 , and •
Using the Cooley-Tukey butterfly from the previous section, we implement the NTT.Similar to in the Cortex-M4 implementation, we pre-compute all the twiddle factors and place them into flash.As our Cooley-Tukey butterfly requires the second coefficient and the twiddle factor in halves, we load those using ldrh (for the unsigned lower half-word) and ldrsh (for the upper signed half-word).This, however, significantly increases register pressure and hinders the common optimization technique of merging multiple levels of butterfly operations with the purpose of saving store and load instructions.Therefore, we can not use that optimization and need to perform one layer at a time.This also leads to a slightly different ordering of the twiddle factors in memory.The results of the butterfly is returned as a 32-bit value and can, hence, be stored back using str.
For the NTT −1 , we proceed likewise.However, the inputs to the butterfly have to be loaded in full-words using ldr.At the end of the NTT −1 , each coefficient of the polynomial is multiplied with the constant n −1 followed by a Montgomery reduction.We integrate this step into the last level of the NTT −1 in order to minimize load and store operations.Furthermore, we observe that n −1 in Montgomery domain is 41 978 and, hence, less than 16-bits.Therefore, we do not need a full SBSMULL, but can use a simpler multiplication routine that multiplies a 32-bit word by the 16-bit constant which requires 2 multiplication instructions and, hence, 2 cycles less.
Besides the NTT and NTT −1 we identify one other place where our compiler is introducing SMULL and SMLAL instruction: The pointwise multiplication •.If either of the multiplicands is secret, the pointwise multiplication must not use the variable time instructions.We guarantee that by rewriting the pointwise multiplication in assembly and making use of the Montgomery multiplication using SBSMULL and SBSMLAL like in our Butterfly operation in Listing 7 and Listing 8.In case both inputs are considered public, we simply use the pointwise multiplication which was presented in Section 3 section.

Time-Memory Trade-Offs
Depending on the programmer's requirements, there are multiple ways in which we can implement Dilithium signing, each with their own tradeoffs.
For microcontroller implementations of Dilithium the main challenge is that computing A is expensive since it involves many calls to SHAKE256 which is relatively slow in software.Also, A is used multiple times during the signing procedure.Consequently, we either have to store the complete matrix A in RAM or flash, or incur the cost of having to recompute it during each loop iteration.
In order to explore this time-memory tradeoff, we implement the signing operation using three different strategies.In the first strategy, we refuse to recompute A during the signing operation and instead store it in flash.The second strategy describes the more traditional implementation of Dilithium, expanding A once during each signing operation before entering the rejection-sampling loop.The third case describes the situation wherein we are highly constrained in flash and SRAM size, but have ample performance budget.In this strategy, we save the amount of memory needed by computing both A and y on the fly.
Although the algorithm's intermediate values can be stored anywhere in the RAM (i.e. in SRAM/CCM for the STM32F407 and SRAM1/SRAM2 for the ATSAM3X8E), we see no real benefit in doing that.Therefore, to keep it simple, we will store the variables on the stack.

Strategy 1: A in Flash
In Dilithium signing, the values A, ŝ1 , ŝ2 , and t0 depend only on the Dilithium key pair.Therefore, instead of computing these values during signing, we can compute these values as part of the key generation.We assume that the platform has some kind of non-volatile storage that is large enough (and secure enough 3 ) to store these extra values.Then, during the signature generation algorithm, instead of passing in sk (as described in line 1 of Algorithm 2), we pass a larger struct that also contains the precomputed values.These precomputed values (A, ŝ1 , ŝ2 and t0 ) add up to k • l + 2k + l polynomials that have to be stored extra.In the case of Dilithium3, this amounts to 34 KiB of extra flash space as each Dilithium polynomial requires 1 KiB when stored uncompressed.
Because these four values are now stored separately, we do not have to compute (and store) them anymore during the signature generation.Thus, this strategy will save a considerable amount of SRAM, in exchange for (relatively cheap) flash space.Furthermore, in the absence of hardware-accelerated SHAKE256, generating A is a relatively expensive step in the signature-generation process.Having A stored in flash will speed up the overall performance of generating signatures.Hence, we think that this strategy will be the most favored to be deployed in a real-world small-devices environment.

Strategy 2: A in SRAM
When there is enough SRAM available on the device, we opt for the "traditional" implementation of the signature generation algorithm.That is, we follow the specification closely, and implement signature generation following the general structure of Algorithm 2. Apart from some space for storing intermediate values, we will need to allocate • 4k polynomial slots for storing t0 , ŝ2 , w, w 1 ; • (k + 3)l polynomial slots for storing A, ŝ1 , y and ŷ; and • 1 polynomial slot for storing ĉ.
This adds up to a pretty high lower bound of k • l + 4k + 3l + 1 KiB of necessary stack space, e.g., 53 KiB for Dilithium3.
3 A, and t0 need to be integrity-protected; ŝ1 , and ŝ2 need to remain secret and integrity-protected.

Strategy 3: Streaming A and y
For the last strategy we considered the situation, wherein we optimize stack usage without using extra long-term storage for precomputed values.In the signing implementation, we optimize exclusively for stack usage.We only intend to find the lower-bound of the needed stack space.
In contrast to the other strategies, we do not store any complete copies of A and y.Instead, we regenerate every element of A and y on the fly when we compute elements of w (in line 8 of Algorithm 2).Because we do not retain y after this step, we regenerate it again in line 11 of Algorithm 2).Relative to strategy 2, this saves us k • l polynomials of space for A, and another l polynomials for y.
When we look further into stack-optimizing the signing algorithm, we find that the main bottleneck in terms of stack usage is the overlapping lifetimes of w and ĉ.In lines 12 and 16 of Algorithm 2, the values r 1 , r 0 and h all depend on both w and ĉ.However, in line 10 we also need the complete value of w 1 (and thus w) to compute ĉ.Therefore, we conclude that we either have to store w and ĉ both at the same time; or we have to recompute every element of w on the fly when we are computing r 1 and r 0 in line 12, and when we are constructing the hint h in line 16.
In order to recompute elements of w, we would have to do the matrix multiplication NTT −1 ( Â • NTT(y)) all over again, including the complete regenerating of A and y.The performance cost of this optimization would be at least a factor 2, so we chose to not do this.Instead we accept that w and ĉ both need to be stored at the same time.

Splitting signature generation in an offline and online phase
To speed up the Dilithium signing process even more, one can choose to split the signature generation in an offline and online phase, where the offline phase can already be performed before the message to be signed is known.The general idea of using an offline/online phase was introduced in 1989 by Even, Goldreich, and Micali [EGM90], and was first proposed for usage in lattice-based signature schemes in [AYS15].It has also been used last year by Ravi, Gupta, Chattopadhyay, and Bhasin in [RGCB19, Section 4.1.2]to optimize the online latency of Dilithium signing.
However for Dilithium, this optimization comes with a significant cost.In their paper, Ravi, Gupta, Chattopadhyay, and Bhasin describe that an additional 260 KiB of space4 is needed to store the precomputed values for Dilithium3, such that there is a 95% probability that at least one of the y values results in a good signature.For our main target (the ATSAM3X8E), that would mean that more than half its flash space would already be lost to storing these precomputed values.We think that, in the general case, the improved signature-generation latency does not justify this kind of loss in available flash space.

Results
This section presents the performance results for our Dilithium implementations.First, we present new speed-records for the Dilithium NTT on the Cortex-M4 and first results for the Dilithium, Kyber, and NewHope NTT in Section 6.1.We then present results for the full Dilithium scheme on the Cortex-M4 (Section 6.2) and on the Cortex-M3 (Section 6.3).Finally, we profile our implementations on the Cortex-M4 in Section 6.5.
Cortex-M4 setup.We benchmark all our Cortex-M4 implementations on a STM32F407 discovery board, which features the STM32F407VG microcontroller.It was clocked at 24 MHz to eliminate flash wait states when fetching instructions or data from flash.For benchmarking the algorithm latency we used the SysTick counter.Our build and benchmarking setup is based on pqm4 [KRSS] and benchmarking our code within pqm4 gives the same performance results.We will open a pull request to merge our code into pqm4.
Cortex-M3 setup.The Cortex-M3 measurements were done on an Arduino Due board which uses the ATSAM3X8E microcontroller.The ATSAM chip was clocked at 16 MHz, which results in a flash access time with zero wait-states.The algorithm latencies were measured using the internal cycle counter (CYCCNT).
Compiler, random numbers, stack measurements, and Keccak.On both platforms, we used the GCC compiler, version 10.2.0.For obtaining random numbers (e.g., ρ and K), we use the hardware random number generators which are available on both cores.The stack usage was measured by filling the memory with sentinel values, executing the algorithm, and measuring the amount of sentinel-value bytes that were overwritten during the execution.In the stack measurements, space reserved for input and output values is not counted.For SHA3 and SHAKE, we use the assembly optimized implementation of the Keccak permutation from the eXtended Keccak Code Package (XKCP) 5 .As it only uses ARMv7-M instructions, we use the same implementation on both platforms.
Side-channel Protection.Our implementations are only considering timing side-channels, i.e., we provide constant-time code that avoids leaking secret data through variable time instructions, secret-dependent branches, and secret-dependent memory addresses.For certain use-cases one may want to consider to also protect against more powerful attacks like power analysis attacks, e.g., using masking.There exists work in the literature for masking Dilithium by Migliore, Gérard, Tibouchi, and Fouque [MGTF19] which presents a protected implementation modified Dilithium.There is more work required for implementing a fully masked Dilithium that is adhering to the specification submitted to NIST.However, this work is outside of the scope of this paper and we leave it for future work.

NTT performance
In Table 2, we list the benchmarking results for the optimized NTT, NTT −1 , and pointwise multiplications (•) implementations in Dilithium, Kyber, and NewHope1024 on the Cortex-M4 and Cortex-M3.For the Cortex-M4, we obtain a speedup of 23% for the NTT and NTT −1 compared to [GKOS18,RGCB19].This speedup is mainly due to the switch to a signed representation of polynomials.We use this representation throughout our new Dilithium implementations, which saves a number of additions of multiples of q.Additionally, we optimize the pointwise multiplication (•) which was not optimized in previous implementations.

Cortex-M4 performance
Table 3 lists the benchmarking results of our Dilithium implementation, together with the cycle counts from the relevant related work.As signing time varies considerably depending on the number of rejections, we performed 10 000 executions and took the average of the resulting cycle counts.
For our signing strategy 1, we need to pre-compute A, ŝ1 , ŝ2 and t0 We include this pre-computation in the key generation.Compared to the [GKOS18] implementation, which is comparable to our signing strategy we obtain speedups of 13%, 27%, and 18% for key generation, signing, and verification respectively.We also drastically decrease the stack consumption.
When comparing to the [RGCB19] implementation, our strategy 1 is similar to their scenario 2, while our strategy 2 corresponds to their scenario 1.For both scenarios, we achieve substantial speedups for all parameter sets ranging from 14% to 20%.
Our strategy 3 implementation which solely optimized for memory footprint, achieves by far the worst performance in terms of speed.

Cortex-M3 performance
Table 4 presents our results for the Cortex-M3.The only other work implementing (a modified version of) Dilithium on the Cortex-M3 is from Migliore, Gérard, Tibouchi, and Fouque [MGTF19].However, they do not report cycle counts on the Cortex-M3, and we were not able to find their source code online.Therefore, we can unfortunately not compare our results to theirs.

Stack usage
Up to this point we have mainly discussed the improvements in Dilithium's speed.However as already mentioned, it is also important to be economic in the usage of stack space.
In Tables 3 and 4, we show the considerable improvement in stack-space usage over the previous works.We see that signature verification needs only around 10 KiB of storage space (depending on the Dilithium parameters), without incurring a performance hit.Furthermore, when Dilithium is deployed on a device that has enough space to store A-either in SRAM or in flash-we get a reasonable signature-generation latency.a "Strategy 1" from Section 5.1 corresponds to "Scenario 2" in [RGCB19].
b "Strategy 2" from Section 5.2 corresponds to "Scenario 1" in [RGCB19].c For Dilithium2 using stack strategy 1, an additional 23 632 bytes of flash space are used for storing the precomputed values.d For Dilithium3 using stack strategy 1, an additional 34 896 bytes of flash space are used for storing the precomputed values.e For Dilithium4 using stack strategy 1, an additional 48 208 bytes of flash space are used for storing the precomputed values.

Table 1 :
Dilithium parameter sets Table 1 also gives the expected number of iterations of the rejection sampling.Due to this iterative nature, the runtime of Dilithium varies significantly between multiple signature generations.Note, however, that the rejection probability does not depend on the secret key, and consequently, the variable run-time caused by rejection sampling does not violate the time constantness of implementations of Dilithium [LDK + 19, Section 3.3].Algorithm 1, Algorithm 2, and Algorithm 3 specify Dilithium key generation, signature generation, and signature verification.The descriptions are consistent with the ones from Figure 4 in the Dilithium specification [LDK + 19], but we omit details about rounding that are not relevant to this work.
Notation.We follow the notation of the Dilithium specification [LDK + 19] and denote polynomials by lower case latin letters like c, vectors of polynomials by bold lower case letters like t, and matrices by bold upper case letters (A).Polynomials, vectors, and matrices that have been transformed to NTT-domain are identified by their hat, e.g., ĉ, â and Â.The operator • describes coefficient-wise multiplication.The operator || denotes concatenation of two inputs that are implicitly converted to a byte-string.||a||∞refersto the maximum absolute coefficient of the polynomial a and is similarly defined for vectors.When sampling a from a certain distribution S, we write a ← S. S η is the uniform distribution ranging from −η to +η (both inclusive).Functions.As a central building block, Dilithium uses the NTT and NTT −1 function which are used to implement efficient polynomial multiplication of a, b as NTT −1 (NTT(a) • NTT(b)).The details of the Dilithium NTT are described later in this section.In addition, Dilithium uses a collision resistant hash-function H with 384-bit output length and a cryptographic hash-function H B outputting a polynomial that has exactly 60 coefficients set to ±1 while the remaining 196 coefficients are zero.The hash functions H and H B are implemented using the extendable-output function (XOF) SHAKE256.Furthermore, Dilithium defines the seed expansion functions ExpandA and ExpandMask; the rounding functions Power2Round, HighBits, and Decompose and the hint functions MakeHint and UseHint.To keep the algorithm description brief, we omit the details of those functions and refer the reader to the Dilithium specification.Scheme Specification.Number Theoretic Transform.At the core of the Dilithium scheme construction and parameter choices is the number theoretic transform (NTT) which allows efficient polynomial multiplication.The NTT can be seen as the counterpart of the Fourier transform in a finite field.NTT-based multiplication allows the multiplication of two polynomials a and b in quasi-linear time by first transforming both arguments to NTT domain (or frequency

Table 2 :
Performance results for NTT, NTT −1 , and • of Dilithium, Kyber, and NewHope for the Cortex-M3 and the Cortex-M4 reported in clock cycles.

Table 5 :
Profiling results on the Cortex-M4 to provide a large speedup.When looking at individual functions of the non-optimized code, no function takes more than 3% of the total run-time.