Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1

. We present new speed records on the Armv8-A architecture for the lattice-based schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly eﬃcient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterﬂies result in signiﬁcantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1 . 7 × that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2 . 0 × . On the Apple M1, our matrix-vector products run 2 . 1 × and 1 . 9 × faster for Kyber and Saber respectively.


Introduction
When large quantum computers arrive, Shor's algorithm [Sho97] will break almost all currently deployed public-key cryptography by solving the integer factorization and the discrete logarithms problems. Preparing for this, the U.S. National Institute of Standards and Technology (NIST) has initiated a process to select new cryptosystems that withstand the increased capabilities of quantum computers -an area known as Post-Quantum Cryptography (PQC). This process naturally divides into categories of digital signatures and key encapsulation mechanisms (KEMs) [NIS] and is currently in the third round, where 7 finalists and 8 alternate candidates still compete [AASA + 20].
Undoubtedly, the scheme(s) selected at the end of the NIST PQC competition will become prominent computational workloads of the future. It is therefore important to understand their performance and resource characteristics on the hugely diverse spectrum of today's and upcoming computing platforms, ranging from low-power IoT devices over desktops to high-end data center cores, to name some. Representative of the former, the focus of performance analysis of PQC on embedded devices has so far been the Arm ® Cortex ® -M4 CPU. Representative of the latter, the focus of performance analysis on high-end cores has so far been the AVX2-capable Intel ® CPUs Haswell and Skylake.
In this article, we contribute to the performance analysis of prominent PQC candidates on CPUs implementing the application profile / A-profile of the Arm architecture 1 -an area which, despite its importance, has been comparatively little studied so far. CPUs implementing the A-profile are ubiquitous, and form a wide spectrum in themselves: It contains power-efficient CPUs like the Cortex-A7 processor for Linux-capable embedded IoT devices, cores for the mobile and desktop market like the Cortex-A78 processor, as well as the Arm ® Neoverse ™ IP for infrastructure applications, for example, the Arm Neoverse-based AWS Graviton processors. The Fugaku supercomputer ranked as the fastest supercomputer in the world in 2020 and 2021 [TOP20,TOP21], is also based on a core implementing the A-profile. Considering the breadth of use and availability of A-profile cores in the computing ecosystem, it is therefore important to include them in the performance evaluation for PQC.
Another axis of distinction within the A-profile is the specific version of the architecture, such as Armv7-A, Armv8-A, and, as of late, Armv9-A, and their respective sets of extensions. In this article, we focus on implementations of PQC on CPUs based on the 64-bit Armv8-A architecture, leveraging the availability of the Armv8-A version of the Arm ® Neon ™ Single Instruction Multiple Data (SIMD) instruction set. We do not study implementations based on the Neon instruction set for Armv7-A or implementations based on the Scalable Vector Instructions SVE and SVE2 here -this is left for future work.
Returning to the nature of the PQC workloads themselves: Many of the remaining NISTPQC candidate schemes are based on so-called structured lattices, for which the central arithmetic operation is modular polynomial multiplication. A central implementation technique for such polynomial multiplication is the Number-Theoretic Transform (NTT), an integer-analog of the Fast Fourier Transform.
In this work, we explore the use of NTTs in implementing the NISTPQC structured lattice finalist candidates. It has always been a point of interest to determine the realms of applicability for various advanced multiplication techniques, and in addition to finding out how well NTTs do, we also compare them to other approaches towards polynomial multiplication, in particular Toom-Cook/Karatsuba.

Contributions.
We exhibit NTT-based implementations of NISTPQC cyclotomic-ring lattice candidates for CPUs implementing the A-profile of the Arm architecture, leveraging the Neon vector extension. We optimized mainly for the common Cortex-A72 CPU (used e.g. in the Raspberry Pi 4), and somewhat for the Apple M1, a high-end desktop core.
We improve on old and discover new implementation techniques, including a Barrettreduction based one-known-factor multiplication, which we show to be roughly equivalent to the Montgomery multiplication technique of [Sei18], but particularly suitable for Neon.
We also introduce the trick of "asymmetric base multiplication" which is applicable whenever we are caching incomplete NTT results (i.e., Kyber/Saber). Furthermore, we improve on the best-known two-unknown-factors multiplications and Barrett reduction sequences in the literature.
Related Work. The most recent work on lattice-based cryptography on the Armv8-A architecture using the Neon vector extension is by Nguyen and Gaj [NG21] and Sanal et al. [SKS + 21]. NTT is frequently used in the context of polynomial multiplications Structure of the paper. This paper is structured as follows: Section 2 introduces the schemes we implement, the Armv8-A microarchitecture, and the mathematical background on NTTs. In Section 3, we present more mathematics in the form of reductions for the implementations used in this paper. In Section 4, we go through the implementation details of the different schemes. In Section 5, we show performance numbers and conclude. In the CPA-secure key generation and encryption the rate-determining operations are ( × ) × ( × 1) matrixto-vector polynomial multiplications A T · s and As (MatrixVectorMul). In decryption, it is the ×1 inner product of polynomials b T ·s (InnerProd). Note that [ABD + 20b] specifies that we do all multiplications via incomplete NTT, and NTT results are in bit-reversed order. The public matrix A is sampled in (incomplete) NTT domain by expanding a seed using the extendable-output function (XOF) SHAKE128. There is 1 matrix-to-vector polynomial multiplication and 0, 1, and 2 inner products of polynomials in each of key generation, encapsulation, and decapsulation respectively.

Saber
Saber [DKRV20] is a NISTPQC finalist candidate lattice-based key encapsulation mechanism based on the Module Learning With Rounding (M-LWR) problem. The module is of dimension × over the ring R q = Z q [x]/ x n + 1 , with q = 2 13 and n = 256. Similar to Kyber, the Saber KEM is built on top of a CPA-secure PKE via the CCA-transform a la [HHK17]. For algorithmic descriptions of the Saber PKE see [DKRV20].
In CPA-secure key generation and encryption, the rate-determining operations are ( × ) × ( × 1) matrix-to-vector polynomial multiplications A T · s and As (MatrixVectorMul). In decryption, it is the × 1 inner product of polynomials b T · s (InnerProd). There is 1 MatrixVectorMul in key generation; 1 MatrixVectorMul + 1 InnerProd in encapsulation; and 1 MatrixVectorMul + 2 InnerProd in decapsulation, as decapsulation needs a full re-encryption.
Note that Saber's base ring Z 2 13 is not a field and thus not directly amenable for application of the NTT. Accordingly, the specification samples the public matrix A in polynomial domain.
Parameters. The module dimension l, the rounding parameter T , and the secret distribution parameter η vary according to the parameter sets Lightsaber, Saber, and Firesaber (respectively targeting the NIST security levels 1, 3, and 5, cf. Table 1).

Dilithium
Dilithium [ABD + 20a] is a NISTPQC [NIS] finalist digital signature scheme based on the M-SIS (Module Small Integer Solutions) and M-LWE problems. The module is of dimension k × over the ring R q = F q [x]/ x 256 + 1 with q = 2 23 − 2 13 + 1 = 8380417.
For algorithmic descriptions see [ABD + 20a]. The core operation of key generation, signature generation, and signature verification is the (k × ) × ( × 1) matrix-to-vector polynomial multiplications As 1 , Ay, and Az (resp.) (MatrixVectorMul). In signature generation, this operation is particularly time-consuming since it is executed in a loop. Similar to Kyber, Dilithium builds a (complete) NTT into the specification, i.e., A is sampled in NTT domain using the XOF SHAKE256. Parameters. Dilithium has parameter sets Dilithium2, Dilithium3, and Dilithium5 targeting the corresponding NIST security levels. For all parameter sets γ 1 = (q − 1)/16 = 523776 and γ 2 = γ 1 /2 = 261888. For each parameter set, the remaining parameters are given in Table 2. The parameters consist of the matrix dimension (k, ), the sampling bounds of the secret η, and the rejection thresholds β and ω.

Modular arithmetic
In this section, we establish some basic facts and notation about modular arithmetic which we will use throughout the paper.
Notation. We will denote any "integer approximation", by which we mean : Q → Z with |z − z | ≤ 1 for all z ∈ Z. Examples include the rounding functions , , , but also e.g. the 2Z-valued z 2 := 2 z 2 , which will be of special interest later. Note that we do not require z = z for all z ∈ Z, and it does in fact not hold for 2 . Let N ∈ N, henceforth called the modulus; in our application we will have either N = 2 k or N = q an odd prime. We denote Z N := Z/N Z (or F q if N = q is a prime) the quotient ring of Z by the equivalence relation x ≡ N y, identifying two integers which leave the same residue after division by N . For z ∈ Z, we denote z ∈ Z N its residue class; in the case of a 2-power N , we use the notation z instead.
The canonical signed and unsigned representatives are related to the integer approximations and via z mod ± N = z − N z N and z mod + N = z − N z N . More generally, we can define z mod N := z − N z N for any integer approximation .
Example 1. We consider z mod 2 N associated with z 2 = 2 z 2 . In this case, z mod 2 N is a representative of z modulo N within {−N, . . . , N } of the same parity as z. For example, if z is even, we have z mod 2 N = z mod ± N if z mod ± N is even, and z mod 2 N = z mod ± N − sign(z mod ± N )N otherwise.

If
satisfies z + 1 = z + 1, then z mod N descends to a function Z N → Z; this is the case for and , and we will also use z mod ± N and z mod + N for z ∈ Z N . We will need the following fact: Fact 1. Let R = 2 n and x ∈ Z R . Then the following holds: Beyond those canonical representatives, we will also be interested in non-unique representatives from intervals {−N , −N +1, . . . , N −1, N }, where N is suitably bounded with respect to N . By "algorithms for arithmetic modulo N " we mean algorithms which compute addition and multiplication in Z N in terms of "small" representatives within those sets. The primary objective in designing such algorithms is the avoidance of generic integer division; a secondary goal is to avoid unnecessary reductions. Our main techniques are Montgomery and Barrett reduction, which we recall briefly now.

Barrett reduction
We have z mod ± N = z − N · z N . The idea behind Barrett reduction [Bar86] is to approximate z where − is a choice of integer approximation and R = 2 n > N is fixed. Since R N can be precomputed, this provides an approximation to z mod ± N relying solely on division by R = 2 n , which common hardware can realize as a bitshift. We denote the resulting approximation to z mod ± N by bar N (z), or bar ± N (z) if = . See Algorithm 1.

Algorithm 2 Montgomery reduction
For any choice of integer approximation , we have In particular, if |z| ≤ R, we have |bar N (z)| < 3 2 N . Proof. The left hand side is equal to N z

Implementation considerations.
In the context of Algorithm 1, the result has absolute value |c − t| < 3 2 N . If N < R 3 , c − t is a signed canonical representative modulo R = 2 n , and thus uniquely determined by its residue modulo 2 n . This observation allows us to perform steps 2 and 3 in Algorithm 1 in single-width arithmetic, leading to the presentation of Barrett reduction in terms of single-width operations alone (Algorithm 3).

Algorithm 3 Barrett reduction, implementation-view
Canonical signed representative

Montgomery reduction
Like Barrett reduction, Montgomery reduction [Mon85] provides a way to trade expensive division by N for cheap division by a power of two. The idea is simple: Assume we would like to reduce z ∈ Z with respect to the odd modulus N , and that z happens to be a multiple of R = 2 n ∈ Z. Then the (cheaply computed) integer division z R is a representative of z · R −1 ∈ Z N which is shorter than z by n bits. In other words, if we accept "twisting" our target residue z by some factor R −1 ∈ Z N , we can shorten its representative. 2 If the given representative z ∈ Z is not evenly divisible by R, we can turn it into one through the following correction step: We need to find some k such that z − kN is divisible by R, that is, z = kN in Z R . Since N and R are coprime, this is achieved by taking k to be a small representative of z · N −1 in Z R ; we will always choose zN −1 mod ± R.
The so obtained "Montgomery reduction" is summarized in Algorithm 2 and denoted mont + N (z) -with R always being implicit. Algorithm 2 also shows a close variant mont − N (z) which implements the correction step through an addition z + kN instead of a subtraction: In this case, we want k to be a small representative of −z · N −1 in Z R , for which we choose −zN −1 mod ± R. We will call this variant "negative" Montgomery reduction. Their relation is as follows: We only talk about smallness of representatives of elements of Z N , not the elements of Z N themselves. In particular, one should not consider the multiplication by R −1 as some form of scaling, but instead as an abstract permutation of Z N .

Algorithm 4 Montgomery reduction, implementation-view
Unsigned single-width multiply 2: k ← k mod ± R Canonical signed representative 3: c ← kN R Signed multiply-high with truncation 4: z high = z R High part extraction 5: mont + N (z) ← z high − c Signed subtraction/addition

Algorithm 5 Montgomery multiplication, implementation-view
Proof. This follows from Fact 1 and the fact that 2 n−1 · x = 2 n−1 for odd x.
Remark 1. The exceptional case z = 2 n−1 can be made explicit: Implementation considerations. We briefly summarize known implementation aspects of Montgomery reduction. Firstly, in the context of Algorithm 2 for mont + This allows to rewrite the algorithm in terms of single-width operations, as detailed in Algorithm 4. This description does not apply to mont − N (z): This is because z+c R will usually introduce a carry-in from low-half to high-half as part of z + c, and this carry is lost in z R + c R . We will revisit this later. Secondly, consider the use of Montgomery reduction for modular multiplication by constants: If z = ab for single-width values a, b ∈ Z, and b is known in advance, then b · N −1 ∈ Z R can be precomputed and leads to Algorithm 5, the Montgomery reduction for products (a.k.a., "Montgomery Multiplication") with one known factor. Finally, for microarchitectures with length-doubling ("long") and non-doubling products, the optimal implementation varies depending on what instructions are available. For the Neon instruction set, please refer to Section 3.2.5 and Algorithm 14.

Fixed point arithmetic
Let n ∈ {16, 32}. The fixed-point interpretation of a single-width signed integer a ∈ {−2 n−1 , −2 n−1 + 1, . . . , 2 n−1 − 1} is the single-precision fractional value a 2 n−1 ∈ [−1, +1). Likewise, the fixed-point interpretation of a double-width signed integer a is the doubleprecision fractional value a 2 2n−1 . In this interpretation, the double-precision product of two single-precision fractional values corresponds to a doubling long multiplication on the corresponding signed integers: For this reason, doubling long multiplications are found in some ISAs supporting fixed point arithmetic: Neon offers SQDMULL, Helium offers VQDMULL, and SVE 2 offers QDMULL.
Similarly, single-precision approximations to the product of two single-precision fractional values correspond to a doubling long multiplication returning high half a Cooley-Tukey (CT) Butterfly

Number Theoretic Transform
In Kyber, Saber, and Dilithium, we need arithmetic in R q = Z q [x]/ x 256 + 1 , a polynomial ring. Here q = 3329 = 13·2 8 +1 for Kyber, q = 2 13 for Saber, and q = 2 23 −2 13 +1 = 8380417 for Dilithium. More specifically, we need to perform a matrix-to-vector multiplication and inner products. All three will be developed in parallel after we switch to a modulus q for Saber [CHK + 21] so that the results are the same as for a computation over Z[x]/ x 256 + 1 .
Let F = F q for Dilithium and Kyber and F = F q for Saber. From the CRT (Chinese Remainder Theorem) we have the ring isomorphism ( Figure 1): To split a 2n-long polynomial, we perform CT butterflies between n pairs of coefficients, each from one half of the polynomial. Everything can be done in-place, and the CT butterfly has an obvious inverse -the GS butterfly, except for a factor of 2. For even n and ±c with square roots in F we can recurse this process, each stage comprising the same number of butterflies. Let ζ be a principal root of order 2 k , so that ζ 2 k−1 = −1. An (incomplete) NTT is defined as the series of ring isomorphisms computed as follows: Here k = 7, d = 2, c 128 = −1 for Kyber, k = 8, d = 1, c 256 = −1 for Dilithium, and ζ = c 2 for both, as the NTT is specified. Saber is flexible and only requires d · 2 k = 256 and c 2 k−1 = −1. The image under the ring isomorphism of a polynomial f is called "the NTT" of f and denoted N T T (f ). A polynomial product f g in the ring is computed as

Modular Multiplication
In this section, we present improvements to Barrett reduction and Montgomery multiplication from a theoretical and implementation perspective: In Section 3.1, we present a relation between Barrett reduction and Montgomery reduction and expand it to a relation between Montgomery multiplication and a new variant of modular multiplication via Barrett reduction which we call "Barrett multiplication". We also introduce two new variants of Montgomery multiplication using fixed-point arithmetic. In Section 3.2, we study implementations on SIMD extensions, focusing on the Neon instruction set. Concretely, we find a 3-instruction sequence for SIMD modular multiplication by known constants, improving on the 4-instruction sequence introduced by [Sei18, LS19], and a 5-instruction sequence for modular multiplication with unknown inputs on the Neon instruction set, improving the sequence used in [NG21]. We also describe a 4-instruction sequence for two unknown inputs; it does impose a parity constraint on the input, however, which may or may not be satisfiable depending on the application context.

Barrett reduction vs. Montgomery reduction
In this section, we compare Barrett and Montgomery reduction for single-width values. Proposition 1. Let N be odd, R = 2 n > N and z ∈ Z. Then Barrett reduction and negative Montgomery reduction satisfy the following relation: More generally, for an arbitrary approximation , we have Remark 2. Since mont − N (x) computes a representative of x · R −1 in Z N , the factor R mod ± N in (1) does not come as a surprise. It also explains why bar ± N (z) and mont − N (zR mod ± N ) can only differ by a multiple of N . The value in Proposition 1 is in working with explicit representatives.
Remark 3. It follows from Fact 3 that Proposition 1 also holds for mont + N (z) except for the case z · R mod ± N = 2 n−1 , in which case bar ± N (z) = mont + N (z(R mod ± N )) − N .
Lemma 1. In the context of Proposition 1, Lemma 1. We have N R N = R − R mod N in Z. Passing to Z R and multiplying by N −1 proves the claim. Equation 2 follows by applying _ mod ± R : Z R → Z.
Proposition 1. The proof is mostly careful unraveling of definitions.
Note that the right hand side of this expression already matches the correction term occurring in mont − R (z (R mod N )). Next, we replace R N = R−(R mod N ) N and obtain N )).

Corollary 1.
In the context of Proposition 1, In particular, if |z| < N , and N < R 2 , then |bar N (z)| < N .
Proof. The first part follows from Proposition 1 and the bound for Montgomery reduction. For the second, recall |_ mod N | ≤ N .

Barrett multiplication
The relation between Barrett reduction and Montgomery reduction exhibited in the previous section raises the question of whether there is a natural extension to Montgomery multiplication, and what the analog on the Barrett side is. The answer turns out to be a variant of Barrett reduction for multiplication with known constants, which we describe in this section. While very natural in retrospect, the definitions and results are novel to the best of our knowledge. Recall (Section 2.4.1) the idea of Barrett reduction: To reduce z ∈ Z, we approximate replacing z R N by the precomputed z R N . If z = ab for a, b ∈ Z and b a known constant, we can improve the quality of the approximation by pulling b into the approximation, approximating ab R N ≈ a bR N , where bR N can be precomputed. We call this the "Barrett multiplication" of a, b and denote it bar N (a, b), or bar ± N (a, b) if = . As before, the choice of R is implicit in the notation and to be understood from the context. We describe Barrett multiplication in Algorithm 6.

Algorithm 7 Montgomery multiplication via doubling, abstract view
Require: N odd modulus, R = 2 n > N Require: a, b ∈ Z representative mod N with |a|, |b| < R.
Algorithm 6 Barrett multiplication, abstract view We now obtain the desired analog of Proposition 1 for Barrett multiplication: Proposition 2. Let N be odd, R = 2 n > N and a, b ∈ Z. Then Barrett multiplication and negative Montgomery multiplication satisfy the following relation: More generally, for an arbitrary approximation , we have Proof. Replace R mod N by bR mod N in the proof of Proposition 1.

Corollary 2.
In the context of Proposition 2, In particular, for |a| < N and N < R 2 we have |bar N (a, b)| < N .

Montgomery multiplication via doubling
Because of the doubling inherent in many fixed point instructions, the variant of Montgomery multiplication described in Algorithm 7 will be useful.
Proof. By construction, ab − kN is divisible by R. Hence, so is 2ab − 2kN , and we get Algorithm 7 thus yields the same result as Montgomery multiplication.

Algorithm 8 Montgomery multiplication via rounding, abstract view
Require: N odd modulus, R = 2 n > N Require: a, b ∈ Z representative mod N with |a|, |b| < R 2 , s.t. a or b is odd.

Montgomery multiplication via rounding
As first pointed out by Seiler in [Sei18] and mentioned in Section 2.4.2, single-width Montgomery multiplication Algorithm 5 breaks when trying to use addition instead of subtraction in the final step, because of a carry-in from low half to high half. Algorithm 8 presents a remedy using rounding. Note the assumption that one of the inputs is odd.

Proposition 4. Algorithm 8 is correct.
Proof Assuming ab = 2 n−1 , Fact 1 therefore implies ab mod ± R = −(kN mod ± R), and hence The result of Algorithm 8 therefore equals that of ordinary Montgomery multiplication. It remains to be justified why ab = 2 n−1 , which involves the assumption that either a or b are odd: If, say, b is odd, then ab = 2 n−1 implies a = 2 n−1 , hence |a| ≤ 2 n−1 = R 2 , contradicting the assumption that |a|, |b| < R 2 .

Implementation
In this section, we look at Barrett and Montgomery multiplication from an implementation perspective, focusing on the Armv8-A version of the Neon SIMD instructions.

Barrett multiplication in 3-instructions
Analogously to Algorithm 3, Barrett multiplication can be described in terms of single-width operations. The details are spelled out in Algorithm 9. We see that Barrett multiplication can be expressed in terms of 3 single-width operations: 1× unsigned multiply-low, 1× unsigned multiply-low-accumulate, 1× multiply-high-with-rounding.

Algorithm 10 Barrett multiplication in Neon
Require: a ∈ Z, |a| ≤ R, to be multiplied Ensure: z ≡ ab (mod N ), |z| < 3 2 N < R 2 . 1: mul z, a, b 2: sqrdmulh t, a, bR N 2 2 3: mls z, t, N In Neon, unsigned multiply-low and multiply-low-accumulate are implemented via MUL and MLA, respectively. The multiply-high-with-rounding operations (a, b) → ab 2 n does not have an exact match in Neon, but as explained in Section 2.5, there is SQRDMULH which computes (a, b) → 2ab 2 n instead. We work around this difference by choosing the even integer approximation 2 as the basis for Barrett multiplication: In this case, we have , which can be implemented through SQRDMULH since bR N 2 /2 can be computed upfront. We show the resulting Neon sequence in Algorithm 10. Note, however, that Barrett multiplication requires one factor to be known upfront. It does not apply for "point multiplication" of two unknown values.
Remark 4. This strategy works for 16-and 32-bit moduli, and any SIMD ISA which offers a double-multiply-high-with-rounding instruction. This includes the M-Profile Vector Extension (MVE), the Scalable Vector Extension 2 (SVE2) and, (16-bit lanes only), AVX2.

Barrett reduction
We comment on the implementation of Barrett reduction (Algorithm 3) in the Neon instruction set. For Barrett reduction of a single-width value, we choose R as large as possible in Algorithm 3 so that R N is still a single-width signed value, increasing precision of the approximation. To compute

Montgomery multiplication via doubling
Ordinary Montgomery multiplication does not lend itself to a straightforward implementation in Neon because Neon does not offer a multiply-high instruction. This is the reason why e.g. [NG21] use the long multiply UMULL to implement Montgomery multiplication. We propose to use the doubling multiply-high instruction QDMULH instead to implement Algorithm 7. Moreover, the final step mont ± N (z) ← z−c 2 can be implemented via the halving subtract instruction SHSUB.
The resulting Neon sequence is shown in Algorithm 12. It provides a 5-instruction sequence for Montgomery multiplication of two unknown values, and a 4-instruction sequence if one factor is a constant.

Montgomery multiplication via rounding
It is natural to try to merge the multiply-high and subtract step in Montgomery multiplication to shorten the modular multiplication sequence further. However, there is no plain multiply-high-accumulate instruction. Instead, MVE, SVE2 and the Armv8.1-A extension to Neon provide a multiply-high-accumulate-with-rounding instruction, which lends itself to an implementation of Algorithm 8 as shown in Algorithm 13. Like Barrett multiplication, this provides a 3-instruction sequence for modular multiplication with a known constant; unlike Barrett multiplication, however, it also applies to two unknown factors, provided one of them is known to be odd. While this is a strong condition, one has some leverage to reason about parity since the result of Algorithm 13 is always even. Note also that since sqrdmlah does not perform halving, Algorithm 13 computes a representative of 2abR −1 instead of abR −1 . Steps 3-7 do a better l,h→c Montgomery reduction than [NG21, Alg. 12, Steps 3-9]. We have so far focused on single-width implementations of Montgomery and Barrett multiplication. Those implementations, however, do not lend themselves well to applications which need to compute sums of modular multiplications: In this case, it is natural to reduce only once after the accumulation, rather than once after every product. Single-width modular multiplication cannot achieve this because it misses the carry-in between low and high parts. Instead, we need Montgomery multiplication using long products. We show an implementation of Montgomery multiplication in long arithmetic in Algorithm 14. After long pairwise products via smull, smull2 from two vectors, we perform a Montgomery reduction by taking the lower halves (conveniently, with uzp1). This we multiply by the inverse of the modulus with mul. We multiply the modulus to the result and accumulate the long products with smlal, smlal2. Now we collate the Montgomery results in the top half happily with uzp2. The result is exactly right as in Algorithm 14. As desired, we can accumulate several products before the Montgomery reduction (cf. Section 4.2).

Implementation
In this section, we fix F = F q for Dilithium and Kyber, and F = F q for Saber. Section 4 is organized as follows: In Section 4.1, we describe our choices of butterflies for NTTs. Section 4.2 introduces the asymmetric multiplication applicable to the MatrixVectorMul in Kyber and Saber. Section 4.3 decribes our findings on how interleaving can be applied for radix-2 NTTs on Cortex-A72.

Butterflies
In this section, we adopt a more architectural viewpoint of radix-2 NTTs.
The need for permutations in the NTTs. For vectorized implementations, an important consideration is the overhead of permuting data within vector registers. This is required when the distance between the inputs of a butterfly is less than the size of a SIMD register. First, we notice that since each SIMD register in Neon is holding 16 bytes, any butterfly taking inputs at distance larger than 16 bytes doesn't require any permutation. Now we carefully look at the NTTs at each layer and number the layers from 0. After each layer, the distance for the butterfly inputs is halved. For Dilithium and Saber, we compute 32-bit NTTs for degree-255 polynomials. At the kth layer, we are computing butterflies with inputs distancing by 4 · 2 7−k = 512 2 k bytes. Therefore, after the 5th layer, any follow-up NTTs require permutations. As for Kyber with 16-bit NTTs for degree-255 polynomials, it is after the 4th layer that one needs permutations for the follow-up NTTs.
{ld, st}{1, 2, 3, 4} and trn{1, 2}. ld{2, 3, 4} are loading with the indicated degree of interleaving. An array of structures with 2 (3, 4) elements are loaded to 2 (3, 4) SIMD registers where the first lane of each register is holding the first structure and so on. On the other hand, ld1 is simply loading consecutively to 1 to 4 SIMD registers. st{1, 2, 3, 4} are their counterparts for storing structures. Besides shuffling with memory operations, we have trn{1, 2} for permuting the elements of SIMD registers. trn1 (trn2) moves the even (odd) indices of the first source to the even indices of the destination and the even (odd) indices of the second source to the odd indices of the destination.
Saber. First of all, since Armv8-A provides instructions for both 16-bit and 32-bit arithmetic, we find no reasons to employ 16-bit NTTs as in [NG21]. We decide to implement 32-bit NTT as implemented for Cortex-M4 in [CHK + 21]. For NTT, we compute for F[x]/ x 256 + 1 with 6 layers of CT butterflies. On the other hand, we compute NTT −1 for F[x]/ x 256 − 1 with 6 layers of CT butterflies and then map F[x]/ x 256 − 1 to F[x]/ x 256 + 1 . Recall that at the end of NTT −1 , we have to multiply each coefficient with 2 −k , so we can multiply each chunk of 4 coefficients with the precomputed (2 −k , 2 −k ζ −1 , 2 −k , . . . , 2 −k ζ −63 ).

Dilithium.
We implement NTT with CT butterflies and NTT −1 with GS butterflies. For NTT, after 4 radix-2 splits, the distance of the butterfly inputs for the next layer is 32 bytes. For the next 4 layers of NTTs, we first load with ld1 and apply two layers of radix-2 splits on four SIMD registers as usual. Next, we transpose the four with trn{1, 2}. At the end of butterflies, we can store with st4 to cancel out the transpose. For the NTT −1 , we invert the entire process: for the initial 4 layers of NTT −1 , we load with ld4, compute two layers with GS, transpose with trn{1, 2}, compute two layers with GS, and store with st1. For the last 4 layers, we invert CT butterflies with GS butterflies and merge half of the multiplications by 2 −8 with ζ −128 .
Kyber. We also implement NTT with CT butterflies and NTT −1 with GS butterflies for Kyber. For the NTT, after 4 radix-2 splits, the distance of the butterfly inputs for the next layer is 16 bytes. Now we proceed for the bottom 3 layers with a different permutation. First, we ld4 each 4-byte to 8 SIMD registers. Next, we use trn1 and trn2 to separate the upper 4-bytes from the lower 4-bytes for butterflies . In this way, we can then apply 3 layers of butterflies without intermediate permutations, which prohibits an aggressive interleaving of instructions. We will discuss the interleaving in Section 4.3. For NTT −1 , we invert the entire NTT and also merge half of the multiplications by 2 −7 with ζ −64 . Table 3 is the summary of butterflies for NTT and NTT −1 .

Asymmetric Multiplication
We introduce an idea that we call asymmetric multiplication for improving the efficiency of matrix-to-vector polynomial multiplication based on incomplete NTTs, and in general whenever incomplete NTTs are cached. It is thus applicable to Kyber and Saber, but not to Dilithium. Recall that for Kyber and Saber, we can compute As as NTT −1 (NTT(A)•NTT(s )), where k = 2 and 4 for Kyber and Saber, respectively. As with any matrix-to-vector product, every entry of NTT(s ) will be multiplied with entries from NTT(A). It is therefore beneficial to cache NTT(s ), which is indeed a known and common optimization technique for NTTbased multiplication. Asymmetric multiplication extends this observation by caching more computations on the multiplicands of one side.
Specifically, note that when computing the product of a = i a i x i and s = i s i x i in F[x]/ x k − ω , we need to compute and sum a i s j for i + j < k and ωa i s j = a i (ωs j ) for i + j ≥ k. It is therefore beneficial to precompute and cache the scaled ωs along with s. After the precomputation, the arithmetic cost for products in F[x]/ x k − ω is then effectively reduced to that of products in F[x]/ x k − 1 . Note that this is different from the isomorphism F[x]/ x k − ω ∼ = F[y]/ y k − 1 , which doesn't work for Kyber as there is no k-th roots for ω. We call the multiplication strategy "asymmetric" because it requires the s-input in expanded form (s, ωs), while the a-input in the usual form. Algorithm 16 is an illustration. Note again the arithmetical similarity between Algorithm 15 and Algorithm 16.
Lastly, we extend the idea of better accumulation for schoolbook multiplication in [CHK + 21] to better accumulation for asymmetric multiplication, giving the 64-bit results and reducing them to 32-bit after computing all the corresponding asymmetric multiplications.

Interleaving for Multi-Layer Butterflies on Cortex-A72
This section describes our findings on how to implement radix-2 NTTs for the CPU Cortex-A72. We follow the software optimization guide of Cortex-A72 [ARM].
In-order frontend and out-of-order backend. In the in-order frontend of the pipeline, instructions are fetched and decoded into internal micro-operations (µops). After renaming the registers, µops are dispatched to the out-of-order backend. In the backend, there is one branch pipeline B, two integer pipelines I0 and I1, one integer multi-cycle pipeline M, two FP/ASIMD pipelines F0 and F1, one load pipeline L, and one store pipeline S. While up to three µops can be dispatched per cycle, there are limitations on the number of each type of µops that can be dispatched simultaneously. The following are the numbers for each type in a single cycle: one µop using B, up to two µops using I0/I1, up to two µops using M, one µop using F0, one µop using F1, and up to two µops using L/S. Furthermore, µops are dispatched in the oldest-to-youngest age order.
F0 and F1 for butterflies. We focus on the pipelines F0 and F1 for our vectorized implementation. The basic building blocks for butterflies are Algorithm 17 and Algorithm 18. Instructions mul, mls, sqrdmulh can only go to F0, while sub and add go to F0 or F1. We interleave the instructions so sub and add have a better chance to be dispatched to F1. We believe F0 is the bottleneck of NTTs, and hence, reducing the loading of F0 speeds up the computation. To facilitate the development of assembly implementation of radix-2 NTTs with interleaving, we split the computation of each layer into two qq_butterfly_top and two qq_butterfly_bot. qq_butterfly_top computes 4 Montgomery multiplications and qq_butterfly_bot computes 4 sub-add pairs. They are designed in the way that if we pass the same arguments to them, calling qq_butterfly_top followed by qq_butterfly_bot implements CT butterflies and reversing the order implements GS butterflies. If there are no dependencies, we can interleave qq_butterfly_bot with qq_butterfly_top giving qq_butterfly_mixed for CT butterflies and qq_butterfly_mixed_rev for GS butterflies .
Multi-layer butterflies. A common approach for reducing the number of memory operations is to compute several layers of NTTs at a time with multi-layer butterflies. To avoid the dependencies when interleaving instructions, we find that for radix-2 NTTs, computing coefficients distributed over 16 SIMD registers is the most beneficial approach. This suggests that 4 layers at a time are possible since 2 4 = 16.

Result
We provide benchmarking results on the Arm Cortex-A72 processor and the Apple M1.
Arm Cortex-A72. The Arm Cortex-A72 CPU implements the Armv8-A architecture and has a triple-issue out-of-order pipeline. Specifically, we use the Raspberry Pi 4 Model B featuring the quad-core Broadcom BCM2711 chipset. It comes with a 32 kB L1 data cache, a 48 kB L1 instruction cache, and a 1 MB L2 cache and runs at 1.5 GHz. For hashing, we use the SHA-3/SHAKE Neon implementation by Nguyen et al. [NG21]. For benchmarking individual functions we make use of the cycle counter of the PMU. For benchmarking the full cryptographic schemes, we use SUPERCOP. 3 We use gcc version 10.3.0 with -O3. Apple M1. The Apple M1 system-on-chip is contained in Apple's 2020 MacBooks. It has four high-performance Firestorm cores, and four energy-efficient Icestorm cores. The Apple M1 has special instructions for Keccak. We use the Keccak implementation by Westerbaan [Wes] which makes use of these instructions. For obtaining cycle counts we make use of m1cycles.c 4 from [NG21]. Due to the large and varying overhead of obtaining cycles, we cannot reasonably benchmark a single execution of a small function. Instead, we benchmark many iterations and report the average. We use clang version 12.0.5 with -O3.
Results for NTT and NTT −1 . Table 4 summarizes our results for the NTT for the level three parameter sets. On the Cortex-A72, we outperform the kyber768 NTT and NTT −1 by Nguyen et al. [NG21] by 19% each, even though we include Barrett reduction which is not reported by Nguyen et al. [NG21]. Compared to the implementation by Sanal et al.
[SKS + 21], the speed-up is even more pronounced at 1.9× for the NTT and 2.4× for NTT −1 . For saber, our 32-bit NTT is 23% faster than the 16-bit NTT by Nguyen et al. [NG21]. However, the 16-bit NTT approach requires two NTTs followed by CRT and, consequently, our real speed-up is 3.1×. On the Apple M1, our speed-up compared to the kyber768 NTT and NTT −1 by Nguyen et al. [NG21] is 1.6× each. The 16-bit saber NTTs are outperformed by our 32-bit NTTs by 44% and 27% on the M1. The actual speed-up when comparing the 32-bit NTT to two 16-bit NTTs followed by CRT is 4.3×. For Dilithium, we obtain a vast speed-up over the reference implementation. Results for full schemes. Table 6 shows our results for the full cryptographic schemes. Kyber runs 9 -14% faster on Cortex-A72 and 26% -41% faster on Apple M1 than previous work [NG21]. For Saber, the speed-ups are even more significant with 24 -35% fewer cycles on Cortex-A72 and 19 -36 % fewer cycles on Apple M1. Unsurprisingly, our Dilithium implementation performs much better than the reference implementation.
On Saber's SHA-3 performance. As we make use of a vectorized SHA-3 implementation computing two Keccak permutations at once, optimal performance can only be achieved if SHA-3/SHAKE calls can be parallelized. Unfortunately, the Saber specification mandates sampling vectors and matrices as a single call to SHAKE and can, thus, not benefit from Table 6: Performance results for the full schemes kyber768, saber, and dilithium3 on Cortex-A72 and Apple M1. On Cortex-A72, we report the median cycle count of 10 000 executions for Kyber and Saber, and 100 000 executions for Dilithium. On Apple M1, we report the average cycle count of 10 000 executions for Kyber and Saber, and 1 000 000 executions for Dilithium. For Dilithium, we sign 59-byte messages.