Improved Plantard Arithmetic for Lattice-based Cryptography

. This paper presents an improved Plantard’s modular arithmetic (Plantard arithmetic) tailored for Lattice-Based Cryptography (LBC). Based on the improved Plantard arithmetic, we present faster implementations of two LBC schemes, Kyber and NTTRU, running on Cortex-M4. The intrinsic advantage of Plantard arithmetic is that one multiplication can be saved from the modular multiplication of a constant. However, the original Plantard arithmetic is not very practical in LBC schemes because of the limitation on the unsigned input range. In this paper, we improve the Plantard arithmetic and customize it for the existing LBC schemes with theoretical proof. The improved Plantard arithmetic not only inherits its aforementioned advantage but also accepts signed inputs, produces signed output, and enlarges its input range compared with the original design. Moreover, compared with the state-of-the-art Montgomery arithmetic, the improved Plantard arithmetic has a larger input range and smaller output range, which allows better lazy reduction strategies during the NTT/INTT implementation in current LBC schemes. All these merits make it possible to replace the Montgomery arithmetic with the improved Plantard arithmetic in LBC schemes on some platforms. After applying this novel method to Kyber and NTTRU schemes using 16-bit NTT on Cortex-M4 devices, we show that the proposed design outperforms the known fastest implementation that uses Montgomery and Barrett arithmetic. Speciﬁcally, compared with the state-of-the-art Kyber implementation, applying the improved Plantard arithmetic in Kyber results in a speedup of 25.02% and 18.56% for NTT and INTT, respectively. Compared with the reference implementation of NTTRU, our NTT and INTT achieve speedup by 83.21% and 78.64%, respectively. As for the LBC KEM schemes, we set new speed records for Kyber and NTTRU running on Cortex-M4.


Introduction
Quantum computers are being developed rapidly. Google [AAB + 19] reported their 53qubit processor in 2019. Two years later, a Chinese research team [WBC + 21] announced a 66-qubit processor named Zuchongzhi. As is known, Shor's algorithm can break the conventional public-key cryptosystems such as RSA and ECC on a sufficiently large quantum computer. Theoretically, a quantum computer with a few thousand qubits should break the 2048-bit RSA. However, there are still noise issues not being well understood. A recent research [GE21] shows that factoring 2048-bit integer in RSA within 8 hours needs a quantum computer with about 20 million noisy qubits. This implies the current progress of quantum computers is still far from breaking the currently deployed 2048-bit RSA or 256-bit ECC. On the other hand, IBM's roadmap of quantum computers [Hac20] claims that the number of qubits could be more than doubled per year, making a one million-qubit machine possible in 2030. The progress of experimental quantum computers and the surprising developments in the future motivated the cryptographic community to expedite the study of post-quantum cryptographic algorithms to find suitable alternatives to the traditional public-key cryptosystems within a decade or less.
In order to promote the development of post-quantum cryptography (PQC), NIST initiated a standardization project in 2016 to solicit, evaluate, and standardize the postquantum cryptographic algorithms. Until now, in the third round of the standardization process, seven finalists, including four key encapsulation mechanisms (KEM) and three digital signature algorithms, have been selected. Among these algorithms, three out of the four KEM finalists (i.e., Kyber [ABD + 20], Saber [BMD + 20], and NTRU [CDH + 20]) are from lattice-based cryptography (LBC), which indicates that LBC is competitive in terms of security and implementation efficiency. Evaluating the implementation efficiency of these candidates is an essential task [AASA + 20] not only in the third round of the NIST standardization project but also in the future commercial deployment.
Among all the operations in LBC, polynomial multiplication is one of the core operations. Therefore, a relatively low complexity method called Number Theoretic Transform (NTT) algorithm is widely used to reduce the computational complexity of polynomial multiplication. During the algorithm design of Kyber and NTTRU, the underlying ring was properly chosen so that NTT could be used. Modular multiplication is the dominant operation inside NTT, and the most commonly used modular multiplication algorithms are Montgomery multiplication [Mon85] and Barrett multiplication [Bar86], which were originally designed for large-moduli cryptosystems like RSA and ECC. However, the moduli of the NIST LBC candidates are relatively small, for example, 3329 for Kyber [ABD + 20], 12289 for Newhope [ADPS16], and 7681 for NTTRU [LS19]. Therefore, the product of two polynomial coefficients will extend to at most a word size (32 or 64 bits). Based on the word size characteristics of these moduli, Thomas Plantard proposed an efficient specialized word size modular multiplication (Plantard multiplication) [Pla21]. The proposed Plantard multiplication by a constant introduces an l × 2l-bit multiplication operation (l = 16/32 bit). Plantard suggested that if the l × 2l-bit multiplication can be implemented in one multiplication instruction in the target platform, the Plantard multiplication by a constant would consume one multiplication fewer than its Montgomery as well as Barrett counterparts and could be well deployed in LBC schemes.
Motivation. Plantard [Pla21] purely focused on the theoretical design yet leaves much room for further exploration. The original Plantard arithmetic cannot be efficiently applied to the existing LBC schemes for the following reasons. First, the original Plantard multiplication ([Pla21, Algorithm 8]) only presents a solution for multiplying unsigned integers mod an odd modulus q. Previous literature [Sei18, BKS19, ABCG20, GKS21, AHKS22] has shown that using unsigned integers in the NTT implementation of LBC schemes is inefficient because it would introduce an extra addition into each butterfly unit. Second, the input range of the original Plantard multiplication is mere [0, q], which requires expensive modular reductions on the polynomial coefficients after each NTT layer.
Previous reports have offered solutions to these problems by adapting the Montgomery arithmetic. Alkim et al. [ADPS16] showed that the large input range of Montgomery reduction enables the so-called lazy reduction technique to reduce the use of expensive modular reduction. Seiler et al. [Sei18] proposed a signed version of Montgomery reduction on the Intel platform, which further speeds up the butterfly units. The follow-up work [BKS19, ABCG20, GKS21] further extended the signed Montgomery reduction to the Cortex-M4 platform. The fastest Montgomery reduction for 16-/32-bit moduli [ABCG20, GKS21,AHKS22] only consumes 2 cycles on the Cortex-M4 platform. Even though Montgomery arithmetic could provide an efficient solution to most of the LBC schemes, this paper shows that Plantard arithmetic could be a potential alternative to Montgomery arithmetic in the LBC schemes with l-bit odd modulus, given that the target platform could implement the l × 2l-bit multiplication in one multiplication.
Our Contributions. This paper aims to improve, customize, and implement the Plantard arithmetic for the existing LBC schemes. Our contributions are threefold: • First, we present an improved Plantard arithmetic for signed integers with its correctness proof. The proposed method is inspired by the observation that the moduli in the existing LBC schemes are normally several times smaller than the modulus constraint in the original Plantard arithmetic.
• Second, we detail the advantages of the improved Plantard arithmetic over its original version, Montgomery and Barrett arithmetic. With some tweaks over the original algorithm, the improved Plantard arithmetic not only supports signed integers but also extends its input range and narrows down its output range, thus enabling better lazy reduction strategies in NTT/INTT. Besides, it inherits the advantage of the Plantard multiplication, i.e., saving one multiplication when one of the two operands is a constant. All these merits make it possible to replace the Montgomery and Barrett arithmetic with the improved Plantard arithmetic in LBC schemes on some platforms.
• Third, we present an efficient implementation of the improved Plantard arithmetic for 16-bit odd moduli using the smulw{b,t} instruction to perform the 16 × 32-bit multiplication on Cortex-M4. By replacing the state-of-the-art Montgomery and Barrett arithmetic with the improved Plantard arithmetic, we demonstrate that the proposed design enables more efficient polynomial arithmetic, namely NTT, INTT, and modular reduction, resulting in faster Kyber and NTTRU implementations on Cortex-M4. To the best of our knowledge, this is the first assembly implementation of NTTRU on Cortex-M4, and we set new speed records for Kyber and NTTRU.
It should be noted that our optimizations for Kyber and NTTRU are not limited to Cortex-M4 and can be extended to Cortex-M7 as well as some 32-bit microcontrollers without SIMD extensions, e.g., the SiFive Freedom E310 with a 32-bit E31 RISC-V core [SiF] (see Subsection 4.3 for more details about the extensibility of our optimizations).

Preliminaries
In this section, we first briefly introduce the variants of the Learning With Errors (LWE) problem. Then, the algorithms of Kyber and NTTRU are described. Finally, the details of the NTT and related modular arithmetic are given.
The Learning With Errors (LWE) problem is systematically defined by Regev in [Reg05] for constructing public-key cryptosystems under the quantum random oracle model. Lyubashevsky et al. [LPR10] introduced a Ring algebraic structure for LWE, in effect solving the inefficient problem of LWE-based schemes; this variant is called Ring-LWE. More recently, Langlois and Stehlé [LS15] proposed the Module-LWE problem, which bridges LWE and Ring-LWE. One of the NIST PQC KEM finalists, Kyber, is constructed based on Module-LWE. The original NTRU, which was proposed in 1996 and first published in 1998 [HPS98] by Hoffstein, Pipher, and Silverman, has gone through a long research history. As one of the KEM finalists, NTRU has been significantly improved in terms of security and efficiency compared with the original design.

Kyber
The security of Kyber is based on the Module-LWE problem, whose formal definition is given as follows: (1) Here, the secret vector s is sampled from a centered binomial distribution B η (R k×1 q ), where k is the dimension of the underlying Module-LWE problem. The public matrix A is sampled from a uniform random distribution U(R k×k q ). The decisional Module-LWE problem indicates that (A, b) is computationally indistinguishable from a uniform random sampling.
The polynomial ring used within Kyber is R q = Z q [X]/(X n + 1) with q = 3329 and n = 256 across all parameter sets. Each polynomial coefficient in R q can be accommodated in a 16-bit integer, which enables a 16-bit NTT polynomial multiplication. Kyber constructs a CCA-secure KEM from a CPA-secure public key encryption (PKE). Kyber PKE consists of key generation (Algorithm 1), encryption (Algorithm 2), and decryption (Algorithm 3). We refer readers to Kyber's specification [ABD + 20] for more details about the CCA-secure KEM. The core operation of Kyber in key generation and encryption is the matrix-vector multiplication, i.e.,Â • NTT(s) andÂ T •t, where each element of the matrix and vector is a polynomial.

NTRU and NTTRU
NTRU, as a finalist in the third round of the NIST PQC project, is the combination of two first-round candidates, NTRUEncrypt [ZCHW] and NTRU-HRSS-KEM [SHRS]. The KEM specified in the NTRU's specification is constructed using a generic transformation of a correct deterministic public-key encryption scheme (correct DPKE)[CDH + 20]. The polynomial arithmetic of NTRU operates in three polynomial rings Z 3 [x]/Φ n , Z q [x]/Φ n , and Z q [x]/ (Φ 1 · Φ n ) with Φ 1 = (x − 1) and Φ n = x n−1 + x n−2 + · · · + 1 . NTTRU [LS19], an NTT-friendly variant of NTRU, is constructed using a partially correct probabilistic public-key encryption scheme (partially correct PPKE). The speed of NTTRU is faster than other NTRU variants, including the third-round finalist NTRU and alternate candidate NTRU Prime. However, the authors only provided security proof

NTT
In this paper, NTT is classified into 16-bit NTT [LS19, ABD + 20] and 32-bit NTT [CHK + 21, ZHLR22] according to the modulus size. For example, the modulus size of Kyber and NTTRU are both smaller than 2 16 , so the NTT of these moduli is classified into the 16-bit NTT. The modulus of Dilithium is q = 8380417, which is larger than 16-bit and smaller than 23-bit; thus, the NTT of modulus q = 8380417 is classified into the 32-bit NTT. We mainly focus on the 16-bit NTT in this paper. NTT multiplication over the polynomial ring Z q [X]/f (X) is based on the fact that the where f i (X) are small-degree (degree-0 for a complete-NTT, higher degree for an incomplete-NTT [CHK + 21]) polynomials. The polynomial multiplication of a, b ∈ Z q [X]/f (X) is first calculated using NTT as for i = 0, · · · , n − 1. Then we compute the base multiplication as a 0 b 0 , · · · , a n−1 b n−1 . Finally, using the inverse NTT (INTT), we find the result polynomial c such that c = ab mod f (X). For Kyber, the polynomial f (X) = X 256 + 1 can be factored, with the help of a primitive 256-th root of unity ζ, as This factorization is incomplete since f (X) is factored as degree-1 polynomials instead of degree-0. Therefore, the NTT of Kyber is a 7-layer incomplete-NTT.
For NTTRU, the polynomial f (X) = X 768 − X 384 + 1 is initially split into (X 384 + 684)(X 384 − 685), then all the way down to irreducible polynomials X 3 ± r. The NTT of NTTRU consists of 8 layers of butterflies, and we refer readers to [LS19] for more details.
The commonly used computing structures of NTT are Cooley-Tukey (CT) butterfly [CT65] and Gentleman-Sande (GS) butterfly [GS66]. The CT butterfly accepts inputs in normal order and generates outputs in bit-reverse order. In contrast, the GS butterfly takes inputs in bit-reverse order and produces outputs in normal order. Therefore, a hybrid structure which uses CT butterfly for NTT and GS butterfly for INTT will accept inputs and generate outputs both in normal order without any bit-reversal operations.

Modular Arithmetic
In this paper, we divide the modular reduction into two types: mod and mod ± , where c mod q obtains a result in [0, q) while c mod ± q gets a signed result in (− q 2 , q 2 ). The "modular reduction" in this paper refers to both types of modular reduction. The modular multiplication c = a · b mod q is normally divided into two parts: the multiplication c = a · b and the modular reduction c mod q. The modular reduction c mod q can be done by c− c/q q, where the division operation is non-constant-time in modern computers. The nonconstant-time characteristic of division prevents it from being used in the implementation of cryptographic algorithms due to security concerns. The commonly used constant time modular reduction algorithms are Montgomery reduction [Mon85] and Barrett reduction [Bar86]. Their basic idea is to replace the non-constant-time division with constant-time multiplications and shifts.
In current LBC schemes, the known fastest modular arithmetic is the signed version of Montgomery multiplication [Sei18], which is given in Algorithm 7. The constant β is often set to word size 2 l so that the modulus q fits in one word. For instance, β = 2 16 when q < 2 16 and β = 2 32 when 2 16 ≤ q < 2 32 . The algorithm replaces the division by multiplications and shifts operations. It requires 3 multiplications. Bar86] approximately computes c/q using (c · λ)/2 2l +γ , where λ is a precomputed constant (λ = 2 2l +γ /q ) and l satisfies 2 l −1 < q < 2 l . Note that γ is an arbitrary integer used to control the accuracy of the approximation (the larger the value, the higher the accuracy). The algorithm of Barrett multiplication is given in Algorithm 8, and it also requires 3 multiplications. Barrett algorithm is different from Montgomery algorithm. Its implementation generally consists of a shift operation for a non-word-size offset (2l + γ), which would require an explicit shift instruction during the implementation on Cortex-M4. We can get the signed version with minor modifications 1 .

Algorithm 8 Barrett multiplication [Bar86]
Input: Operand a, b such that 0 ≤ ab < 2 2l +γ , the modulus q satisfying 2 l −1 < q < 2 l , and the precomputed constant λ = 2 2l +γ /q proposed a new modular multiplication (Plantard multiplication) tailored for the word size moduli and claimed that the proposed algorithm outperforms other existing solutions, including Montgomery and Barrett multiplication, in some application scenarios. The original Plantard multiplication is given in Algorithm 9. Here, we denote X mod 2 l as [X] l , X >> l as [X] l , where l is a positive integer. The Plantard multiplication also needs 3 multiplications, which is the same as Montgomery and Barrett multiplication. But the key advantage of Plantard multiplication is that one could precompute the product of b and q −1 mod 2 2l when operand b is a constant. This introduces an l × 2l-bit multiplication a · (bq ). If the target platform can compute the l × 2l-bit multiplication in one multiplication instruction, then one multiplication could be saved for the overall modular multiplication. This advantage mainly comes from the fact that the product of two operands c = a · b is only used once in Plantard multiplication, while its Montgomery and Barrett counterparts use it twice (see step 2,4 in Algorithm 7 and step 2,3 in Algorithm 8).

Algorithm 9 Original Plantard Multiplication [Pla21]
Despite this advantage, it should be noted that the original algorithm is designed for unsigned integer inputs within the range [0, q]. It has been well-illustrated that using a signed version of modular arithmetic in NTT/INTT will further speed up the computation. Therefore, the original Plantard multiplication could be extended to support signed arithmetic and purchase faster NTT/INTT implementations.

Target Platform: Cortex-M4
Our target platform is STM32F407G-DISC1, with a 32-bit ARM Cortex-M4 processor implementing the ARMv7E-M instruction sets. It provides 16 32-bit general-purpose registers, but only 14 of them are available for programming. Because Cortex-M4 supports floating-point (FP) computation, 32 32-bit FP registers are available for programming and can be used to "cache" frequently-used values.
The Cortex-M4 also supports Single Instruction Multiple Data (SIMD) instructions that can operate two halfwords or four bytes in parallel. The essential instructions in our implementation include smul{b,t}{b,t}, smulw{b,t}, and ldrd. We use Rd to represent the destination register; Rm and Rn represent two source registers. The smulbb instruction multiplies the bottom halfword of Rm by the bottom halfword of Rn and writes the result to Rd. The smulwb instruction multiplies Rm by the bottom halfword of Rn; extracts the most significant 32 bits of the result, and writes it to Rd. The ldrd R8, R9, [R3, #0x20] instruction loads R8 from the address R3+32 and loads R9 from the address R3+36.

Improved Plantard Arithmetic
This section presents the improved Plantard arithmetic, which was initially proposed in [Pla21]. Our improvement is built by first imposing a slight restriction on the modulus based on the observation of current LBC schemes (e.g., Kyber and NTTRU); then, several tweaks are proposed to improve the Plantard arithmetic. Detailed correctness proof and advantages of the improved algorithm will be presented.

Improved Plantard Multiplication
The improved Plantard multiplication is presented in Algorithm 10. Due to the adaptation of signed numbers, we denote (X mod ± 2 l ) as [X] l , (X >> l ) as [X] l for a positive integer l . Here, (X >> l ) is equivalent to X/2 l or X/2 l for a positive or negative number, respectively. For simplicity, we use X/2 l to represent (X >> l ) for sign-unknown number X unless a negative number is explicitly specified. We denote q as an odd modulus, Algorithm 10 Improved Plantard multiplication l + 2 α q l 2: return r l as the minimum word size (e.g., 16, 32 or 64 bits) such that q < 2 l−α−1 , where α is a small integer. The improved Plantard multiplication is based on the observation that the moduli adopted by most of the existing LBC schemes are several times smaller than the restriction q < 2 l φ in the original algorithm [Pla21], e.g., 12-bit modulus 3329 in Kyber, 13-bit modulus 7681 in NTTRU, 23-bit modulus 8380417 in Dilithium. Thus, there are still margins between the original constraint and the moduli adopted in LBC schemes. Besides, all of the current state-of-the-art modular arithmetic is implemented with signed integers, which can eliminate extra additions of the multiple of q to ensure a positive output of the butterfly unit. Considering that the Plantard multiplication proposed by Thomas Plantard has the advantage of multiplying a constant (i.e., one multiplication is saved) but only supports unsigned integers, we believe that adapting signed integers for this algorithm will lead to more efficient modular arithmetic in LBC schemes.
We start by first introducing a small integer α ≥ 0, which narrows down the range of the modulus q from q < 2 l φ to q < 2 l−α−1 < 2 l−α φ . Based on this tweak, we can then replace the original main step r In order to adapt signed inputs for this algorithm, we modify the inequality that maintains the correctness of this algorithm from 0 < q2 l − p 0 q + ab < 2 2l to 0 < q2 l+α − p 0 q + ab < 2 2l under certain circumstances (see Subsection 3.2 for more details). We will show the correctness proof and advantages of the improved algorithm in the following two parts.

Proof of the Improved Plantard Multiplication
Theorem 1 states the correctness of Algorithm 10, and we will give the specific proof below.
Proof of Theorem 1. The main step of Algorithm 10 is to compute Since q 2 l−α < 1 2 , we can get r > − q 2 from the left-hand side of the inequation. Let's consider q 2 + (2 α −1)q 2 l on the right-hand side; since q < 2 l−α−1 , we have Because q is an odd number, then Therefore, the result r lies in (− q 2 , q 2 ).
2. Then, we check that r = ab(−2 −2l ) mod ± q. The proof is very similar to the one in [Pla21], except that we introduce a small integer α ≥ 0 and impose a stricter constraint q < 2 l−α−1 on q. Since q is an odd number, there exists a p = abq −1 mod ± 2 2l so that , then instead of analyzing q2 l − p 0 q + ab in the original work, we slightly modify the equation to q2 l+α − p 0 q + ab. The tweak here is the key to the improved Plantard arithmetic, which allows us to accept both positive and negative inputs and extend the range of the inputs a, b to [−q2 α , q2 α ]. We will further prove that this modification will not change its correctness under the constraints in Theorem 1.
The correctness of the Plantard multiplication proposed by Thomas Plantard is based on the inequality: 0 < q2 l − p 0 q + ab < 2 2l . We now check that our modified equation q2 l+α − p 0 q + ab also satisfies this inequality: (1) When ab > 0 and p 0 < 0, we have Therefore, for a small integer α ≥ 0, we have q2 l+α − p 0 q + ab < 2 2l . The proof of the right-hand side of Equation 3 ends.
Overall, we obtain that Combining with the fact that pq − ab is divisible by 2 2l , we have the following equation: In summary, the improved Plantard multiplication is achieved by first imposing a stricter constraint on the modulus q. Then, the inequality that guarantees the correctness is modified into 0 < q2 l+α − p 0 q + ab < 2 2l . Note that the improved Plantard multiplication can be easily modified into a word size modular reduction (Plantard reduction).

Comparisons
Original Plantard multiplication. Although the main steps between the improved and original Plantard multiplications have only a minor difference, our tweaks make it more efficient and practical in LBC schemes thanks to the following merits.
• The improved algorithm supports signed inputs and produces a result in (− q 2 , q 2 ), while the original version only accepts unsigned integers in [0, q]. The advantage of using signed integers in LBC schemes has been fully discussed in previous work, i.e., eliminating an extra addition during the butterfly computation. Montgomery and Barrett arithmetic. The improved modular arithmetic also has several merits over the state-of-the-art Montgomery and Barrett arithmetic in LBC schemes.
• First of all, as introduced in [Pla21], the intrinsic advantage of the Plantard multiplication is that it costs one multiplication fewer than Montgomery and Barrett multiplication when one of the two inputs is fixed. This is achieved by precomputing the product of q and the constant input modulo 2 2l . Moreover, the Barrett arithmetic may require an explicit shift operation for a non-word-size offset, which will further decrease its efficiency. We refer readers to Subsubsection 4.2.4 for a detailed comparison with the Barrett arithmetic.
• Secondly, the improved Plantard multiplication extends the input range to [−q2 α , q2 α ]. We can easily modify this algorithm into the Plantard reduction that accepts input in [−q 2 2 2α , q 2 2 2α ] (see Algorithm 13 for details). Compared with the Montgomery reduction that accepts [−2 l−1 q, 2 l−1 q], the improved Plantard reduction supports more redundancy if α is big enough. Therefore, we recommend choosing the largest α that satisfies the prerequisites so that the improved algorithm has the largest input range. The input range of the proposed modular reduction is hard to compare with the Barrett reduction directly because the input range of the Barrett arithmetic depends on the parameter setting, i.e., q and γ in Algorithm 8. The Barrett reduction used inside NTT/INTT in Kyber and NTTRU only needs to reduce 16-bit signed integers. Our modular reduction (see Algorithm 15) can also cover all 16-bit signed integers and can be a perfect replacement for the Barrett reduction in the NTT/INTT of these LBC schemes.
• The output range of the improved algorithm is narrowed down to (− q 2 , q 2 ), which is the same as the state-of-the-art Barrett  With all these merits over the Montgomery and Barrett arithmetic, the improved Plantard arithmetic has two weak spots. If we precompute the product of q and the twiddle factors modulo 2 2l in NTT, the size of the precomputed intermediate result will be doubled. First, it needs to deal with the l × 2l-bit multiplication, which may not be applicable to architectures like AVX2 and NEON. However, we show that this method is perfectly suitable for Cortex-M4, Cortex-M7 and some 32-bit microcontrollers (see Subsection 4.3). Second, the double-size twiddle factors are normally placed into the Read-Only Memory (ROM) instead of the Random Access Memory (RAM). Consequently, the side effect is that we need more cycles to load them into registers (see Subsubsection 4.2.2 for detailed discussion); this will not affect the stack usage. Nevertheless, arithmetic and experimental analyses show that the overall cycle reduction obtained from the aforementioned merits could offset the cycle increment introduced by twiddle factor loading.

Optimized Implementation on Cortex-M4
In this section, we present an efficient implementation of the improved Plantard arithmetic for 16-bit word-size moduli on Cortex-M4 and apply it to Kyber and NTTRU. Then, the extensibility of the improved Plantard arithmetic on other platforms and schemes is discussed. Note that the Plantard arithmetic mentioned below refers to the improved Plantard arithmetic presented in Section 3.

Efficient Plantard Arithmetic for 16-bit Modulus
Before moving forward into the implementation details, it is assumed that there exists a small integer α ≥ 0 such that the prerequisite (q < 2 l−α−1 ) mentioned in Theorem 1 is established (which is the case for most of the LBC schemes, including Kyber and NTTRU). The word size in the following parts is set to l = 16 to support the 16-bit arithmetic in Kyber and NTTRU. As recommended in Subsection 3.3, the maximum α that satisfies Theorem 1 is α = 3/α = 2 for Kyber/NTTRU.
The l + q2 α ] l with the smlabb instruction. Here, the term q2 α is fixed and can be precomputed without extra runtime overhead. The final result locates in the top half of r. Compared with Algorithm 12, the state-of-the-art 16-bit Montgomery multiplication implementation on Cortex-M4 [ABCG20], we reduce one multiplication by precomputing the product of q −1 and the constant operand b. As for its comparison with the Barrett arithmetic, we refer to Subsubsection 4.2.4 for more details.
Note that [LS19] proposed a similar trick for Montgomery multiplication with the AVX2 instructions, which is improved over the work in [Sei18]. It is achieved by precomputing q −1 /−q −1 with the bottom half of c (see step 2 of Algorithm 7/Algorithm 12). Unfortunately, it is not suitable for Cortex-M4 because the trick shown in [LS19] highly relies on the two-instruction-fashion multiplication, whereas the 16/32-bit multiplication on Cortex-M4 is carried out in a single instruction. Applying their optimization on Cortex-M4 introduces extra multiplication instructions as stated in [BKS19, Subsection 3.2].
As for the multiplication of two variables, we present a 2-cycle Plantard reduction over a 32-bit signed product c ∈ [−q 2 2 2α , q 2 2 2α ], which is shown in Algorithm 13. Instead of using smulwb as we did in Algorithm 11, the mul instruction is used to compute [[cq ] 2l ] l , and the intermediate result locates in the top half of r. The result is then obtained by the smlatb instruction. It is worth noting that the input range of Algorithm 13 is q2 2α−l+1 times larger than the Montgomery reduction (the reduction version of Algorithm 12).

Efficient 16-bit NTT/INTT Implementation
This subsection details the improvements the improved Plantard arithmetic brings to the butterfly unit, layer merging, and lazy reduction in NTT/INTT.

Butterfly Unit
The core operations in NTT/INTT are performed in the butterfly unit. Commonly, there are two types of butterfly structures, namely the CT butterfly and GS butterfly. In Kyber, using CT butterfly in NTT and GS butterfly in INTT is a common strategy to avoid coefficients flipping. However, recent reports [CHK + 21, AHKS22] state that using CT butterfly for both NTT and INTT in LBC schemes would also result in faster code. The key operation in the butterfly unit is the modular multiplication by a proper twiddle factor ζ, which is a precomputed constant. It is common to store the twiddle factors in the Montgomery domain when using Montgomery arithmetic. Similar to Montgomery arithmetic, Plantard arithmetic also produces a result in a special domain, namely c(−2 −2l ) mod ± q. Therefore, to integrate Plantard arithmetic into the butterfly unit, we also store the twiddle factors in the "Plantard" domain by multiplying each twiddle factor with a constant −2 2l mod q. Besides, to utilize the improved Plantard multiplication by a constant (Algorithm 11) in the butterfly unit, one also needs to multiply q −1 mod ± 2 2l by the twiddle factor ζ in the "Plantard" domain modulo 2 2l , namely ζ = (ζ · (−2 2l ) mod q) · q −1 mod ± 2 2l . Note that this precomputation would result in a 2l-bit twiddle factor. Although it would introduce extra cycles for data loading, this overhead can be offset by the improvements brought by the improved Plantard arithmetic.
Algorithm 14 illustrates the detailed instruction sequence of the double CT butterfly, which computes a = (a top +b top ζ)||(a bottom +b bottom ζ) packed arguments a, b. Thanks to the SIMD instruction smulw{b,t}, one can pack two 16-bit coefficients into one 32-bit register b. Then, one can perform the Plantard multiplication (Algorithm 11) by a proper twiddle factor ζ over the top and bottom half of b using the smulwt and smulwb instructions, respectively. The follow-up instruction sequence is the same as the previous work [ABCG20]. In summary, we obtain a 7-instruction double CT butterfly for packed arguments, which reduces 2 instructions compared with the one that uses Montgomery multiplication. Similarly, we can also obtain a 7-instruction double GS butterfly for INTT.

Layer Merging
As stated in Subsubsection 4.2.1, using the improved Plantard arithmetic will double the size of the twiddle factors. It would bring extra load instructions to load the 32-bit twiddle factors compared with the original 16-bit version. However, the improvements brought by the improved Plantard arithmetic can easily bury this overhead thanks to the layer merging techniques. There are two efficient layer merging strategies for Kyber on Cortex-M4, which are presented in [ABCG20] and [AHKS22], respectively. The layer merging strategy in [ABCG20] is the 3-layer merging strategy (3-3-1), while [AHKS22] adopts the 4-layer merging strategy (4-3). Since the first 4-layer NTT re-uses the same 15 twiddle factors multiple times, [AHKS22] proposes to cache the 15 16-bit twiddle factors into 8 FP registers and replace the memory access instruction with the cheaper vmov instruction. Apart from using the FP registers and the vmov instruction, the 4-layer merging strategy is actually built upon the 3-layer merging strategy in [ABCG20].
The 3-layer merging strategy in [ABCG20] allows us to compute 8 butterflies at the cost of loading 1, 2, or 4 32-bit twiddle factor(s) in each layer. Note that loading 2 consecutive 32-bit twiddle factors can be achieved by a 3-cycle ldrd instruction on Cortex-M4, which is merely 1-cycle slower than the original ldr instruction to load 2 consecutive 16-bit twiddle factors. We rearrange the twiddle factors in the same order as they are used so that one can use ldrd to load 2 consecutive 32-bit twiddle factors. Overall, by integrating the Plantard arithmetic into the 3-layer merging strategy, one could reduce 8 cycles (reduce 1 cycle for each butterfly) with 0, 1, or 2 extra cycle(s) to load the 32-bit twiddle factors in each layer. For the 8 butterflies that only require 1 twiddle factor, one could reduce 8 cycles at no extra cost since ldr has the same cost as ldrh.
As for the 4-layer merging strategy, due to the double size of the twiddle factors, one needs to load 15 twiddle factors into 15 FP registers with the vldm instruction. Compared with 8 FP registers in the original design, our design consumes 7 extra cycles. Besides, during each iteration of the first 4 layers, one needs 7 extra vmov instructions to retrieve the twiddle factors from the FP registers to general registers. This extra cycle consumption can be totally covered by the cycle reduction brought by the improved Plantard arithmetic.
In order to reveal the actual effect brought by the improved modular arithmetic in our implementation, both 3-layer and 4-layer merging strategies are used in Kyber, while the better 4-layer merging strategy is adopted in NTTRU. Besides, for the implementation of NTTRU and Kyber with the 3-layer merging strategy, we use CT butterflies in NTT and GS butterflies in INTT. As for the Kyber implementation with the 4-layer merging strategy, the CT butterflies are adopted in both NTT and INTT, similar to [AHKS22].

Lazy Reduction
To better illustrate the improvements the improved Plantard arithmetic brings to both CT and GS butterflies, we only discuss the lazy reduction strategies for the NTT with CT butterflies and INTT with GS butterflies here. When Montgomery arithmetic is applied to NTT and INTT, all of the coefficients will grow by q after each layer of NTT, while half of the coefficients will double after each layer of INTT. The well-known lazy reduction technique can minimize the number of modular reductions in NTT/INTT, which is mainly determined by the input range of the modular multiplication. Compared with the Montgomery arithmetic, the output range of the improved Plantard arithmetic is halved, while the input range of the improved Plantard reduction (Algorithm 13) is about 2 α times larger than Montgomery reduction (i.e., the reduction version Algorithm 12). Therefore, applying the improved Plantard arithmetic in NTT/INTT can halve/decrease the growing rate of coefficients compared with the one that uses Montgomery arithmetic, thus enabling better lazy reduction strategies. CT Butterfly. Specifically, for the modulus q = 3329 (9q < 2 l−1 < 10q) in Kyber, the forward NTT has 7 layers of butterflies. Therefore, the 7 layers of butterflies that use Montgomery multiplication will expand all of the 256 coefficients by 7q. Therefore, at least one modular reduction is required to reduce the 256 coefficients to perform the follow-up base multiplication. On the contrary, when applying Plantard multiplication, the coefficients only grow by 3.5q in 7 layers. Since the inputs of NTT are always smaller than q, 4.5q lies in the modular multiplication input range [−q2 α , q2 α ] of Kyber (α=3 for Kyber). Therefore, the output coefficients of NTT with Plantard multiplication can be directly used in the base multiplication without modular reduction.
It should be noted that the coefficient expansion would have a bigger effect on schemes with bigger modulus like NewHope [ADPS16] and NTTRU [LS19]. As for the modulus q = 7681 (4q < 2 l−1 < 5q) in NTTRU, all of the 768 coefficients need one modular reduction after every three layers of butterflies when Montgomery multiplication is used. Since the forward NTT in NTTRU has 8 layers of butterflies, we need two modular reductions after the third and sixth layers of butterflies. The last two layers of butterflies will produce coefficients smaller than 3q. These coefficients cannot be directly used in Montgomery multiplication because (3q) 2 = 9q 2 , which is out of the input range of q2 l−1 . Therefore, the forward NTT implemented with Montgomery arithmetic in NTTRU needs three modular reductions for 768 coefficients. However, when using the improved Plantard multiplication in NTTRU, we only need one modular reduction for 768 coefficients after the seventh-layer butterflies. The final layer of butterflies only expand the coefficients up to 1q, which lies in the input range of [−q2 α , q2 α ] (α = 2 for NTTRU), and these coefficients can be directly used in the base multiplication. In sum, two modular reductions for 768 coefficients are saved compared with the implementation that uses Montgomery arithmetic, which fully illustrates the benefits of a larger input range and smaller output range of our method. GS Butterfly. As for the INTT with GS butterflies in Kyber, the advantages of applying the Plantard arithmetic are twofold. First, the maximum input value of INTT is halved after the matrix-vector multiplication (i.e., Step 6 of Algorithm 2). If one uses Montgomery arithmetic in the matrix-vector multiplication, the coefficients produced in this process would be smaller than 2q, 3q, and 4q for Kyber512, Kyber768, and Kyber1024, respectively. Therefore, one modular reduction of coefficients is required after the first and second layers of butterflies in INTT for Kyber768/Kyber1024 and Kyber512, respectively. In our case, the modular reduction will have a one-layer delay since the matrix-vector multiplication generates coefficients smaller than 1q, 1.5q, and 2q. Second, since the maximum value of the reduced coefficient is 0.5q after each modular reduction, 4 layers of butterflies can be conducted over the reduced coefficients instead of 3 when using Montgomery's method. After the second-layer butterflies of Kyber768/Kyber1024, the modular reduction is applied Algorithm 15 Double Plantard reduction for packed coefficients Input: A 32-bit packed integers a = a top ||a bottom where a top , a bottom are two 16-bit signed coefficients Output: r = (a top mod ± q)||(a bottom mod ± q), −q/2 < r top , r bottom < q/2 1: const ← (−2 2l mod q) · (q −1 mod ± 2 2l ) mod ± 2 2l precomputed 2: smulwb t, const, a 3: smulwt a, const, a 4: smlabt t, t, q, q2 α 5: smlabt a, a, q, q2 α 6: pkhtb r, a, t, asr#16 7: return r for all of the 128 coefficients; then all coefficients have a maximum value of 0.5q. The third-layer butterflies have a step size of 8, and the first 8 coefficients (a 0 ∼ a 7 ) will first grow to 8q after the sixth-layer butterflies. The final-layer butterflies have a step size of 128. Two sets of 8 coefficients (a 0 ∼ a 7 , a 128 ∼ a 135 ) have a maximum value of 8q, while the rest of the coefficients are smaller than 4q. Therefore, we only need to perform modular reduction on these 16 coefficients instead of all of the 128 coefficients in previous reports, and other coefficients will be naturally reduced by the modular multiplication with the twiddle factors and 128 −1 in the final-layer butterflies.
The GS butterflies are also adopted for the INTT in NTTRU. The optimization brought by our method is similar to the INTT in Kyber, except that the modular reduction for half of the coefficients is required after every 3 layers of butterflies with Plantard arithmetic instead of 2 when using Montgomery arithmetic. Thus, only 2 modular reductions for 384 coefficients are required after the third and sixth layers of butterflies.

Double Plantard Reduction for Packed Coefficients
The modular reduction of coefficients is necessary if the next operation in NTT/INTT would cause an overflow. Previous work [ABCG20] uses double Montgomery reduction or double Barrett reduction (i.e., Algorithm 10 and Algorithm 8 in [ABCG20]) to reduce two packed coefficients on Cortex-M4, which consume 7 cycles and 8 cycles, respectively. The double Barrett reduction in [ABCG20, Algorithm 8] requires two explicit shift operations for packed coefficients. Recent work further reduces its cycle counts down to 6 cycles by eliminating two explicit shift operations using the smlawb and smlawt instructions (see [AHKS22,Algorithm 3.2]). In this work, an even better 5-cycle double Plantard reduction is proposed and shown in Algorithm 15. This algorithm is based on the fact that the Plantard reduction over a 16-bit signed integer can be viewed as a Plantard multiplication by a constant −2 2l mod q. And one can multiply q −1 by the constant −2 2l mod q to obtain a precomputed constant, thus saving one multiplication by q −1 . Note that the input range of Algorithm 15 is the same as Algorithm 11. Since Algorithm 15 generates signed output in (− q 2 , q 2 ), this algorithm is only applicable inside NTT/INTT. The Barrett reduction is still required outside NTT/INTT to obtain a positive result.

Extensibility on Other Platforms and Schemes
The efficiency of the Plantard arithmetic for 16-bit moduli relies on the smulw{b,t} instruction to perform the 16 × 32-bit multiplication on Cortex-M4. The 16 × 32-bit multiplication would make it difficult to apply the Plantard arithmetic to architectures like AVX2 and NEON. However, the proposed optimizations for 16-bit NTT are not limited to Cortex-M4 but can also be extended to Cortex-M7 as both of the devices share the same SIMD extensions [Lor16]. Moreover, the Plantard arithmetic for 16-bit moduli may Algorithm 16 The improved Plantard multiplication by a constant on RISC-V Input: A 32-bit signed integer a ∈ [−q2 2α , q2 2α ], a precomputed 2l-bit integer bq where b is a constant, q = q −1 mod ± 2 2l Output:  [SiF], equipped with a 32-bit E31 RISC-V core, as an example. The instruction sequence of Algorithm 11 on RISC-V is shown in Algorithm 16. Here, the 16 × 32-bit multiplication is directly implemented with the 32 × 32-bit multiplication instruction, and the result locates in the top half of r. One extra shift operation is required to shift it down to the bottom half for the follow-up operations. Compared with its Montgomery and Barrett counterparts [Gre20, Listing 5.2.10] on RISC-V, the improved Plantard multiplication by a constant reduces one multiplication instruction (the intrinsic advantage of the Plantard arithmetic) and introduces an extra shift instruction. On the SiFive Freedom E310 microcontroller, one multiplication instruction costs 5 cycles while one shift instruction costs 1 cycle [SiF]. Therefore, the improved Plantard arithmetic is better than its Montgomery and Barrett counterparts on this platform. Similarly, for other 32-bit microcontrollers without SIMD instruction, our method for 16-bit moduli would still outperform its Montgomery and Barrett counterparts if multiplication instruction is slower than the shift instruction.
Similar analysis results can be obtained on the Plantard arithmetic for 32-bit moduli. The improved Plantard arithmetic would still outperform its Montgomery and Barrett counterparts if the target platform provides native 32 × 64-bit or 64 × 64-bit multiplication instructions, and the multiplication instruction is slower than the shift instruction. In this work, we only focus on accelerating Kyber and NTTRU with 16-bit NTT on Cortex-M4. We leave the exploration of Plantard arithmetic's practical applications on various schemes and platforms as future work.

Benchmarking Setup
The benchmarking results are obtained on STM32F407G-DISC1 with the STM32F407VGT6 MCU. The board is equipped with 1MiB of flash memory and 192 KiB of RAM. The benchmarking setup can refer to pqm4 [KRSS], and the microcontroller is clocked at 24 MHz to avoid wait states during memory operations as recommended in pqm4. The compile tool is arm-non-eabi-gcc version 10.2.1, and we compile our code with -flto and -O3 options, which are the same as [ABCG20] and [AHKS22]. The Keccak implementation is token from pqm4, and the hardware random number generator (RNG) of the microcontroller is used in our implementation.
The proposed Kyber implementation is initially built upon the pqm4 code, which consists of the optimizations presented in [ABCG20,BKS19]. Later work [AHKS22] presents a faster Kyber implementation, including a high-speed and stack-friendly version.

Performance of the Polynomial Arithmetic
The cycle counts of the core polynomial arithmetic in Kyber and NTTRU are presented in Table 1. Using the improved Plantard arithmetic, the proposed Kyber implementation based on [ABCG20] achieves 20.24% and 16.92% speed-ups for NTT and INTT, respectively. The speed-optimized implementation of [AHKS22] trades the speed with extra stack usage while implementing the matrix-vector multiplication and base multiplication, which explains their two times faster base multiplication. Besides, the coefficients of their matrix-vector multiplication lie in [−q, q], which reduces one modular reduction of 128 coefficients at the beginning of their INTT implementation. On the contrary, the matrix-vector multiplication of their stack-friendly implementation produces coefficients smaller than kq, which requires extra modular reductions in INTT and results in a slower INTT implementation. After applying the improved Plantard arithmetic in their stack-friendly implementation, we halve the coefficient size of the matrix-vector multiplication down to 1 2 kq; then, three different lazy reduction strategies are adopted for the INTT in Kyber512, Kyber768, and Kyber1024, respectively. Overall, the proposed implementation based on the stackfriendly code of [AHKS22] achieves 25.02% speed-up for NTT and 20.84%/18.56%/17.97% speed-ups for the INTT in Kyber512/Kyber768/Kyber1024, respectively. Our INTT implementation is also 14.38%/11.92%/11.28% faster than their speed-optimized INTT implementation in Kyber512/Kyber768/Kyber1024, respectively. All these speed-ups mainly come from the efficient Plantard multiplication by a constant and better lazy reduction strategies. The results clearly reveal the improved Plantard arithmetic's benefits to the NTT/INTT on Cortex-M4. The base multiplication of Kyber consists of modular multiplication by a constant and modular multiplication of two variables, which increases the register usage pressure and requires extra cycles to handle this problem.
The polynomial arithmetic implementation of NTTRU is similar to Kyber. By fully utilizing the SIMD instructions, the double butterflies, double modular reduction over packed coefficients, double base multiplication, and double base inversion are first presented by applying the improved Plantard arithmetic. The assembly implementation of NTTRU obtains excellent speed-ups compared with its reference implementation in [LS19]. Table 1 shows that the speed-ups for NTT, INTT, base multiplication, and base inversion are 83.21%, 78.64%, 76.40%, and 59.34%, respectively. The speed-ups for NTT and INTT mainly come from the efficient Plantard multiplication by a constant, better lazy reduction techniques, and parallel implementation utilizing the SIMD instructions on Cortex-M4. The speed-up for base inversion is relatively smaller than others because most of the modular multiplications in this operation are implemented as the modular multiplications of two variables. Therefore, the efficient Plantard multiplication by a constant (Algorithm 11) can not be applied to base inversion. Table 2 shows the cycle counts and stack usage of the KEM protocols, namely key generation (KeyGen), encapsulation (Encaps), and decapsulation (Decaps). The Kyber implementation based on either [ABCG20] or the stack-friendly code of [AHKS22] achieves better speed performance than its Montgomery-based counterpart with the same stack usage. For the Kyber implementation based on [ABCG20], the speed-ups for KeyGen, Encaps, and Decaps are 1.27%-1.76%, 0.73%-1.09%, and 1.32%-1.78%, respectively. As for the Kyber implementation based on the stack-friendly code in [AHKS22], we achieve speedups of 1.76%-2.09%, 1.13%-1.50%, and 1.78%-2.48% for KeyGen, Encaps, and Decaps, respectively. It is worth noting that our implementation based on the stack-friendly version of [AHKS22] outperforms their high-speed version with approximately half of the stack. If the stack usage is not one of the primary improved goals, one can also integrate the improved Plantard arithmetic into their speed-optimized implementation for better efficiency. As stated in previous work [KRSS19], the dominance of the hashing functions of Kyber will reduce the overall effect of the optimized polynomial arithmetic, which explains the relatively small speed-ups for the KEM protocols of Kyber.

Performance of Schemes
As the first assembly NTTRU implementation on Cortex-M4, we obtain 49.24%, 45.01%, and 54.56% speed-ups for KeyGen, Encaps, and Decaps compared with its reference implementation. The excellent speed-ups mainly come from the highly optimized assembly implementation of NTT, INTT, base multiplication, and base inversion implemented with the improved Plantard arithmetic. It is obvious that NTTRU outperforms every variant of Kyber by a large margin in terms of speed on Cortex-M4. Compared with the fastest Kyber512, NTTRU provides 37.91%, 55.03%, 46.19% faster KeyGen, Encaps, and Decaps with approximately 3.45∼4.05 times larger stack usage, respectively. In sum, although it is not a candidate for the NIST PQC competition, its efficiency might make it suitable for some specific application scenarios or platforms. This work 267k 237k 254k 9 372 7 452 8 816 a Implementation based on [ABCG20], b Implementation based on the stack-friendly code of [AHKS22].