Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography

. Side-channel attacks can break mathematically secure cryptographic systems leading to a major concern in applied cryptography. While the cryptanalysis and security evaluation of Post-Quantum Cryptography (PQC) have already received an increasing research eﬀort, a cost analysis of eﬃcient side-channel countermeasures is still lacking. In this work, we propose a masked HW/SW codesign of the NIST PQC ﬁnalists Kyber and Saber, suitable for their diﬀerent characteristics. Among others, we present a novel masked ciphertext compression algorithm for non-power-of-two moduli. To accelerate linear performance bottlenecks, we developed a generic Number Theoretic Transform (NTT) multiplier, which, in contrast to previously published accelerators, is also eﬃcient and suitable for schemes not based on NTT. For the critical non-linear operations, masked HW accelerators were developed, allowing a secure execution using RISC-V instruction set extensions. With the proposed design, we achieved a cycle count of K:214k/E:298k/D:313k for Kyber and K:233k/E:312k/D:351k for Saber with NIST Level III parameter sets. For the same parameter sets, the masking overhead for the ﬁrst-order secure decapsulation operation including randomness generation is a factor of 4.48 for Kyber (D:1403k) and 2.60 for Saber (D:915k).


Introduction
Rapid progress in the area of quantum computers drives the need for new cryptographic algorithms resistant against attacks that use quantum computers. While classical publickey cryptography, such as RSA and Elliptic Curve Cryptography (ECC), will be broken with a large-scale quantum computer, Post-Quantum Cryptography (PQC) refers to a set of algorithms that are supposed to be secure against cryptanalytic attacks using a quantum computer. To accelerate the transition from classical to quantum-secure cryptography, the National Institute of Standards and Technology (NIST) started a standardization process [Nat16] and recently selected seven algorithms as finalists and eight alternate candidates [AASA + 20]. Out of the seven finalists, five schemes are based on the hardness of structured lattice problems. Lattice-based cryptography has become one of the most important PQC categories as it is characterized by a very high performance and relatively small ciphertext and key sizes.
In the last years, there has been a strong focus on efficient implementations of PQC 1 Preliminaries

Module Learning With Errors and Module Learning with Rounding
The NIST PQC finalists Kyber and Saber are based on the Module Learning with Errors (MLWE) and Module Learning with Rounding (MLWR) problems, respectively. The MLWE and MLWR problems are both variants of the Ring Learning with Errors (RLWE) problem [LPR10].
Let x denote the flooring operation, i.e. rounding towards negative infinity. The rounding operation x rounds towards the nearest integer with ties being rounded up, in other words, it holds that x = x + 0.5 . We also reserve the notations x f and x f which implies flooring (resp. rounding) x up to f fractional digits. We reserve bold notation for matrices and vectors (of polynomials).
Let R q = Z q / φ(x) be a polynomial ring with the integer q and the cyclotomic polynomial φ(x). An MLWE instance is defined by (A, b = A · s + e) with the public matrix A ∈ R k1×k2 q sampled from a uniform distribution U, the secret s ∈ R k2 q sampled from a binomial distribution Ψ η1 with parameter η 1 , and the error e ∈ R k1 q sampled from Ψ η2 with parameter η 2 . In contrast, the MLWR instance is defined by (A, b = p q (A · s) ), replacing the error by a deterministic rounding function that scales the product by p/q and rounds the result to the nearest integer modulo p. As it is known to be a hard problem to distinguish MLWE/MLWR samples from a uniform sample pair and to recover the secret from b, these samples are well suited to build cryptographic schemes.

Masking
Masking [CJRR99] is a well-known countermeasure against SCA. It splits a secret variable into multiple parts called shares. A first-order masking uses two shares, and aims to protect against SCA that extract information from the first-order statistical moment. The algorithm is executed on these shares individually to hide any power consumption that would be correlated with the original secret.
As we deal with matrices and vectors of polynomials, which are further split into shares, we use several separate indices. To index a matrix or a vector, we use square brackets, e.g. Threshold Implementations (TI). TI is an effective method to prevent SCA and leakages caused by glitches [NRR06]. The concept is based on multi-party computations. For a TI the following properties must hold: correctness (the result after the computations remains correct), non-completeness (the partial functions and computations are at least independent of one input share), and uniformity (input and output are uniformly distributed). For a security order of d and a function of algebraic degree t, the number of input shares must be s ≥ td + 1. Thus, first-order TIs require a minimum of three input shares for non-linear functions.
418Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography Domain Oriented Masking (DOM). DOM is another masking method that provides side-channel resistance and glitch protection [GMK16]. In contrast to the function-oriented nature of TI, DOM operates on share domains. Operations that process shares from a single domain are uncritical as they can only leak information from one particular share domain. Without the information of the remaining domains, an attacker gains no advantage. Nonlinear operations that combine shares from different domains require additional randomness to refresh the cross-domain operations. Registers in the cross-domain paths make sure that the terms are refreshed before being combined to the resulting shares and thus, prevent glitches. The first-order DOM representation of an AND-gate is an important example for DOM [GMK16].

Horizontal Attacks on Masked Implementations
The aim of this work is to present novel masked accelerators and to compare Saber and Kyber from a masking perspective. The algorithms outlined in Section 2 aim to protect our implementation against first-order DPA. Apart from the implemented masked equality test, these components extend readily to higher-order masking, which secures against higher-order DPA.
Besides DPA, other side-channel attack vectors must be considered as well. In particular, there is a broad category of attacks that analyze side-channel traces horizontally. These attacks are able to defeat masking countermeasures as the leakage of both corresponding shares is already present in a single horizontal trace. Single-trace template attacks against masked NTT software implementations have been shown in [PPM17,PP19] and against Keccak in [KPP20]. The attacks of these works are based on Soft-Analytical Side-Channel Attacks (SASCA) [VCGS14], which take the output of a template attack and feed it into a graph representation to apply belief propagation. This method allows to retrieve correct results, although a single trace is usually not sufficient due to, e.g., a low SNR measurement. In contrast to the previously described attacks, where shares are retrieved independently, [NDGJ21] uses a deep-learning model to directly combine leakage of the individual shares horizontally. They show an attack on the masked Saber implementation of [BDK + 21], recovering the secret key in as few as 16 traces.
Increasing the masking order provides only limited extra security against these types of attacks, since each share is still present horizontally in the trace. Effective countermeasures against such attacks are hiding techniques, especially shuffling, as they break the temporal localization of leakage within a trace. As these countermeasures are usually less expensive than masking and can be implemented similarly for Kyber and Saber, we leave their evaluation for future work. An evaluation of such hiding countermeasures for R-LWE schemes was conducted in [OSPG18]. The authors reported 1% overhead on top of the masked design. In [RPBC20] the authors evaluate different possibilities of shuffling for the NTT as a countermeasure against single-trace attacks and reported an overhead in the range of 181% up to 356% in comparison to an unprotected NTT software implementation.
Hiding is an interesting field of research for an open-source platform like RISC-V because countermeasures can be integrated in hardware (within a dedicated accelerator) or in software. Hardware implementations have the advantage that multiple shares can be processed simultaneously. This way, the trace can no longer be partitioned into each of the shares. Additionally, in both software or hardware, shuffling countermeasures can be extended with blinding techniques [Saa18, ZBT19, HP21] to further increase the noise levels. In our design, the integration of hiding techniques might require further pre-and post-processing steps and hardware modifications, e.g., changes in address units. A detailed analysis of the integration of hiding techniques is outside the scope of this work.

Kyber and Saber Decapsulation
Both Kyber and Saber include a CPA-secure encryption scheme, from which they build a CCA-secure KEM. Since the plain encryption scheme can be already broken without DPA using CCA-style attacks, we choose to mask the CCA-secure KEM. Moreover, DPA typically requires a large amount of collected traces to be effective. The CCA-secure decapsulation is the only feasible target, since this constitutes the only operation where multiple traces can be collected for the same long-term secret key. As such, we focus our masking efforts on the decapsulation of Kyber and Saber.
In some scenarios, key generation or encapsulation might also need to be protected against SCA. However, in that case, the adversary usually has access to only a single trace to either retrieve the long-term secret key or the ephemeral session key, respectively. Attacks that target these operations therefore typically fall into the category of horizontal attacks. As mentioned before, we leave a detailed treatment of hiding techniques, and consequently secure implementations of key generation and encapsulation, as future work. Masking techniques cannot fully protect against the horizontal type of attacks, but they can still be used to harden the implementation. Both key generation and encapsulation require the same primitive operations as decapsulation, and our masking techniques would be equally applicable for these routines.
In Algorithms 1-6, we recall the decapsulation of Kyber and Saber, which uses the CPA-secure encryption and decryption as subroutines. We use a simplified notation that highlights both the similarities and differences between the two schemes. The listings use common symbols and operators, they hide the encodings into byte arrays, and they abstract away from the various transformations into and out of NTT domain. We note that Kyber uses a prime modulus q = 3329, whereas Saber chooses power-of-two moduli q = 2 13 and p = 2 10 . For a full description of Kyber and Saber, we refer to their respective round 3 specification documents [SAB + 20, DKR + 20]. Both Kyber and Saber use a variety of symmetric primitives, all of which are based on the SHA3 standard: the hash functions G and H, an extendable output functions XOF , and a key-derivation function KDF.

Masking Kyber and Saber
In this section, we describe the algorithms and methods necessary to create masked implementations of Kyber and Saber. These algorithms define the hardware architectures of our secure accelerators. Our masked implementations of the decapsulation operation in Saber and Kyber are illustrated in Figures 1 and 2, respectively.
As MLWE/MLWR-based schemes, Saber and Kyber use polynomial arithmetic as their main computational building block. Linear operations, such as ring multiplications with an unmasked input, additions, and subtractions, can be duplicated and performed on each arithmetic share individually. As a result, expensive operations such as polynomial multiplications become increasingly attractive to accelerate in hardware. Our developed generic hardware accelerator for polynomial arithmetic is presented in Section 3. The polynomial multiplications that we accelerate are highlighted in yellow in Figures 1 and 2.
Non-linear operations are more complex to mask. These operations combine information from both of the shares, and special care must be taken such that they do not jointly leak the secret unmasked value. In Figures 1 and 2, these operations are highlighted in blue. Typically, these operations are expressed in terms of bit-operations, and it is often more natural to fall back to methods based on Boolean masking. The combination of both arithmetic and Boolean masking in Saber and Kyber requires the use of mask conversion algorithms to switch from either Boolean to Arithmetic (B2A) or Arithmetic to Boolean (A2B) masking.
A masked implementation of Saber decapsulation targeting the ARM Cortex-M4 has been proposed in [BDK + 21]. The authors of [BDK + 21] show that Saber is relatively efficient to mask, and argue that this is due to Saber's choice for a power-of-two modulus and the deterministic rounding of MLWR. For the non-linear masked routines, they use a masked Keccak implementation [BDPVA10], a masked binomial sampler [SPOG19], and a masked comparison algorithm [OSPG18], and these exact same methods can be reused for a masked implementation of Kyber. We integrate them into our secure masked accelerators and discuss their hardware architectures in Section 4. To implement B2A and A2B conversions, the authors of [BDK + 21] adopt an algorithm due to Goubin [Gou01] and a table-based algorithm [CT03], respectively. Both of these algorithms are specialized for power-of-two moduli and can therefore not directly be reused for Kyber. Motivated by this observation, we choose to implement different B2A and A2B techniques. In the remainder of this section, we first outline the B2A and A2B conversions that we implement, and we subsequently use them to propose a novel method for masked ciphertext compression. 422Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography

B2A and A2B Conversions
B2A and A2B conversions allow to securely convert between an arithmetic masking x = A 0 + A 1 and a Boolean masking x = B 0 ⊕ B 1 . These methods may choose to keep a single random mask A 1 = B 1 = R, in which case the conversions compute either A2B and B2A conversion methods were first proposed by Goubin [Gou01]. In software implementations, A2B conversions can efficiently be implemented using table-based methods [CT03,Deb12,VBDV21]. This is the approach taken in [OSPG18] and [BDK + 21]. The drawbacks of table-based methods are that they do not extend to higher-order security, require work-arounds to handle prime moduli [OSPG18], and that they are relatively difficult to translate to a hardware implementation that also resists glitches. B2A conversions, on the other hand, are typically not table-based. In [BDK + 21], Goubin's B2A method is used, which is specialized for power-of-two moduli. Some ad-hoc methods for prime-modulus B2A q and A2B q conversion were proposed in [OSPG18], and subsequently formalized in [BBE + 18] and [SPOG19].
In contrast to the previous masked implementations [OSPG18] and [BDK + 21], we employ A2B and B2A conversions that are based on secure masked arithmetic addition over Boolean shares (SecAdd) [CGV14]. Our reasoning is many-fold. First, since both A2B and B2A conversion can be expressed in terms of SecAdd, we are able to accelerate both operations with a single hardware block. Second, in [BBE + 18] this secure adder was extended to work with prime moduli. SecAdd q essentially makes two calls to SecAdd, such that B2A q and A2B q can additionally be accelerated with the same SecAdd hardware. Third, SecAdd only requires the masked implementation of a binary adder, and efficient TI implementations of ripple-carry or Kogge-Stone variants have already been proposed [SMG15]. Finally, the A2B and B2A approaches based on secure addition are readily extensible to higher-order security, which is not the case for table-based A2B or Goubin's B2A algorithms. We now describe the conversion based on SecAdd in detail. Our focus is on univariate, first-order side-channel security, and wherever possible we simplify the algorithms to focus on this case. For a general description focusing on arbitrary orders and (multivariate) composability, we refer to the original works [CGV14, BBE + 18].
SecAdd takes as inputs the Boolean maskings x {0:1} = (x 0 , x 1 ) and y {0:1} = (y 0 , y 1 ) and outputs a Boolean masking s {0:1} = (s 0 , s 1 ) such that (s 0 ⊕ s 1 ) = (x 0 ⊕ x 1 ) + (y 0 ⊕ y 1 ). SecAdd q [BBE + 18] can be constructed from SecAdd by securely computing a second sum (s 0 ⊕ s 1 ) = (s 0 ⊕ s 1 ) + (q 0 ⊕ q 1 ), where (q 0 , q 1 ) is a Boolean masking of −q in two's complement form. If x + y ≥ q, then s = (x + y − q) is the correct sum (x + y) mod q, and it also holds that s ≥ 0. Alternatively, if x + y < q, then s = (x + y) is the correct sum and s < 0. Since s is negative in one case and positive in the other, the masked sign bit c = sign(s ) can be used to select the correct sum: Having a distinct sign bit requires that s is computed up to at least w = log 2 (q) + 1 bits, i.e. one bit larger than the initial masks. SecAdd q is illustrated in Algorithm 7. We propose a new simplified version of SecAdd q , which assumes that the input shares already satisfy (x 0 ⊕ x 1 ) + (y 0 ⊕ y 1 ) = x + y − q in two's complement 1 . In this case, it is possible to directly compute s = x + y − q through SecAdd(x {0:1} , y {0:1} ). If s < 0, q Algorithm 10: A2B q [BBE + 18] must be added again to find the correct sum. This time, rather than using c = sign(s ) to multiplex between s and s + q, we compute the correct sum as s = s + c · q. This is easily possible, since the multiplication with q distributes over the masking c 0 ⊕ c 1 = c, i.e. q · c 0 ⊕ q · c 1 = q · c. Our simplified SecAdd q routine is shown in Algorithm 8. It avoids the masked multiplexer altogether.
A2B conversion follows directly from SecAdd. Given an arithmetic masking x = A 0 + A 1 , the secure addition of A 0 and A 1 with outputs in Boolean masked form is exactly an A2B conversion. A2B and A2B q based on this idea are illustrated in Algorithms 9 and 10. In these algorithms, the shares A 0 and A 1 are first themselves shared as a Boolean masking, before being fed into SecAdd. As we hinted at before, we have full control over this initial Boolean masking. Therefore, for A2B q , we create an initial masking of A 0 + A 1 − q, and use our simplified version of SecAdd q .
B2A conversion uses a similar idea. Given a Boolean masking x = B 0 ⊕ B 1 , the first arithmetic share A 0 is simply sampled randomly. The second arithmetic share can then be computed from the first one by securely computing A 1 = (B 0 ⊕ B 1 ) − A 0 mod 2 k . Like in the A2B case, first a Boolean masking is created of −A 0 mod 2 k and subsequently this is fed into SecAdd. The result is a Boolean masking A 1 = B 0 ⊕ B 1 , which can be decoded 2 to find the second share A 1 . B2A and B2A q are illustrated in Algorithms 11 and 12. Again, to utilize our simplified and more efficient version of SecAdd q , we simply create an initial two's complement Boolean sharing of (−A 0 mod q) − q instead.

Masked Compression
Both Saber and Kyber include a compression operation that rounds away some low-order bits of a ring element. In Line 3 of the decryption step, the compression operation is used for message decoding, i.e. mapping ( q 2 · m + e) back to m. In the encryption step, Lines 5 and 6, the same operation is used to compress the ciphertext components u and v.
Algorithm 11: B2A [CGV14] Input: Algorithm 12: B2A q [BBE + 18] For Saber, ciphertext compression is inherently tied to the security of its MLWR instance, whereas Kyber initially 3 only compressed the ciphertext components to reduce their size. The Kyber compression function takes an input x ∈ Z q and outputs an integer in {0, . . . , 2 d − 1}, where d < log 2 (q) : For Saber, where q = 2 13 is a power of two, Compress 2 k can be expressed as a more simple logical shift. In order to round the result instead of flooring, the constant h 1 , h 2 or h are added before the shift. Compression must discard some lower-order bits of arithmetically masked ring elements. Discarding these bits is inherently a Boolean operation, and A2B conversion can help to mask this operation effectively. In [BDK + 21], a new technique is proposed to optimize A2B conversion for masked logical shifting in Saber. Compression for prime moduli has so far only been treated in the context of message decoding in [RRVV15] and [OSPG18]. We first review these existing approaches and show that they do not extend efficiently to ciphertext compression. Subsequently, we outline a novel method to mask Compress q , which is simple and efficient to implement.

MaskedCompress 2 k
For power-of-two moduli, ciphertext compression constitutes a simple rounded logical shift. In an arithmetic sharing, (x msb x lsb ) = (A 0 msb A 0 lsb )+(A 1 msb A 1 lsb ), this shift needs special consideration because the lower bits A 0 lsb + A 1 lsb might contain a carry that must be added to the upper bits before they are shifted out. A straightforward solution is to first perform A2B conversion, since a Boolean masking ( The masked Saber implementation of [BDK + 21] optimizes table-based A2B conversion to only compute the carry for the lower bits, rather than a full conversion. This carry is subsequently added to the higher bits, leaving them as an arithmetic sharing. The full procedure is termed A2A conversion.
The A2A optimization also applies to the A2B conversion based on SecAdd. In this setting, when only the carry-out is required, all the summation logic can be pruned from the binary adder. Furthermore, since the carry is only needed at the final position, any carry look-ahead logic can be implemented maximally sparse. However, this optimization would also prevent us from supporting B2A conversion with the same SecAdd hardware. Hence, we implement the more simple solution, i.e. a full A2B conversion and subsequent share-wise logical shift. Operations that require Boolean masking grouped in light gray.

MaskedDecode q
Masked decoders have been proposed in [RRVV15] and [OSPG18]. Rather than dividing by the modulus, the decoding step is expressed as an interval comparison: For an arithmetic masking x = A 0 + A 1 , [RRVV15] proposes a probabilistic decoder that uses information on the quadrants of A 0 and A 1 in a masked table lookup. A different approach is taken in [OSPG18], where a series of A2B-related transformations are used to create a masked decoder. Lacking an existing A2B q transform, the authors propose another conversion, TransformPower2, that transforms an arithmetic masking mod q to a masking mod 2 k . Nevertheless, we do have an A2B q conversion available, and use it to simplify the masked decoder of [OSPG18]. The resulting process is shown in Figure 3. The final result is a Boolean masking where the most significant bit is a masking of the decoded result: (m 0 ⊕ m 1 ) = m.

MaskedCompress q
Unfortunately, the techniques of masked decoding do not extend to masked compression. When dealing with 2 intervals of width q 2 , it is possible to position their boundary exactly at a power of two as in Figure 3. However, already for d = 2 we have 4 intervals of width q 4 , and this technique is no longer applicable.
We propose a substantially different masked compression technique. Rather than expressing it as an interval comparison, we analyze and mask the required division by the modulus q. The key idea is simple. First, we observe that the compression tolerates an approximate quotient x ≈ (2 d /q) · x. In other words, remains correct for a small bounded error e. The reason for this is apparent if we express (2 d /q) · x as a binary fraction: 426Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography The fractional part equals 2 d ·x mod q q , and it is strictly limited to the set of values {0,...,q−1} q . This fractional part is never exactly 0.5, but instead the edge-case values are q 2 q = 1664 3329 and q 2 q = 1665 3329 , which should be rounded down and up, respectively. These values are still rounded correctly, even when subject to a small error − 1 2q ≤ e < 1 2q : As a result, the approximate quotient (2 d /q) · x + e is rounded correctly, given the same bound on e.
Our simple but crucial observation to build MaskedCompress q is that we can compute such an approximate quotient individually from the shares of x = x 0 + x 1 mod q, using only finite-precision arithmetic. For example, using integer division, we can compute which is a strict underestimate of the real quotient (2 d /q) · x mod 2 d with e < 0. More generally, we can compute rounded share-wise quotients, with a bounded error e. In both cases, the rounding error e can be arbitrarily lowered by increasing the number of fractional bits f . As a result, for an appropriately large choice of f that fixes − 1 2q ≤ e < 1 2q , the share-wise 'fixed-point' quotients of Equations 7 and 8 can be used to correctly retrieve the output of Compress q .
We now analyze the requirements on f in detail. The share-wise quotients of Equations 7 and 8 consist of d integer and f fractional bits, with the remaining bits being truncated or rounded, respectively: When the quotients are truncated as in Equation f produce a strict underestimate of the real quotient (2 d /q) · x. This underestimate has the effect of truncating the actual quotient (2 d /q) · x, and possibly omits a carry-in from the additive shares at the f -th fractional bit: Nevertheless, this underestimate can still be rounded correctly, if f is chosen such that fractional values larger than 0.5 do not underflow below 0.5. Specifically, when (2 d /q) · x Algorithm 13: MaskedCompress q Kyber, this holds for f ≥ 13. 4 We can similarly analyze the rounded quotients of Equation 8. By rounding at the (f + 1)-th binary digit, the worst-case rounding error is |e i | < 1 2 f +1 for each share-wise quotient. 5 The total rounding error for two shares therefore remains strictly bounded by , it suffices that f > log 2 (2q), which again results in f ≥ 13 for Kyber. As truncation is easier to implement than rounding and results in the same bound, we choose to implement it in our algorithm.
After computing share-wise quotients with a certain precision f , we obtain a 'fixed-point' arithmetic sharing: with d integer bits and f fractional bits. For an appropriately large choice of f , this 'fixed-point' arithmetic sharing allows us to recover the output of Compress q (x, d): Somewhat surprisingly, we have reduced MaskedCompress q exactly to the problem of MaskedCompress 2 k . The final output of Compress q (x, d) constitutes the upper d bits of the (d + f )-bit arithmetic sharing (x 0 , x 1 + 0.5), which we compute with a (d + f )-bit A2B conversion and subsequent share-wise logical shift. As before, the A2A conversion of [BDK + 21] is applicable to optimize the computation of the carry-in, but prevents unified hardware in our case. We illustrate our MaskedCompress q routine in Algorithm 13 and also graphically in Figure 4, using only integer arithmetic and flooring divisions. 6 The simplicity is apparent, requiring only a single A2B call that combines information from the shares. For higherorder security, f must be chosen to tolerate rounding errors from an increasing number of shares. As a result, the required bit-size of the A2B grows logarithmically with the number of shares. For first-order security with f = 13, the largest value for d is d u = 11 in Kyber-1024, requiring a 24-bit power-of-two A2B conversion. Using our novel MaskedCompress q algorithm, masked Kyber does not require any actual A2B q conversion. 4 Since 1665 3329 = b0.1000000000001... as a binary fraction, an underflow is allowed at the 13-th position.
5 More precisely, |e i | ≤ q 2 q 2 f . 6 As an implementational note, for d > 7, (2 d+f ) · x i can grow larger than 32-bit. The result must be placed in a uint64_t, and special care must be exercised that the division of a uint64_t by the constant q compiles to a constant-time operation.
428Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography Figure 4: MaskedCompress q

Masked Equality Test
At the end of the decapsulation, the re-encrypted ciphertext c = c 0 ⊕ c 1 must be checked for equality against the input ciphertext c. The end result of the check is no longer sensitive, but the re-encrypted ciphertext itself must not be unmasked.
Both first-order and higher-order secure algorithms for masked equality testing have been proposed in [OSPG18] and [BPO + 20], respectively. In [OSPG18], the main idea is to use an additional hashing step and check whether H(c ⊕ c 0 ) equals H(c 1 ) 7 . The collisionresistance of H guarantees that the two hashes are only equal for a valid ciphertext, and the pre-image resistance ensures that the hashes no longer contain exploitable information about c .
Recently, it was shown that both the [OSPG18] and [BPO + 20] methods leak some information on c , and that this information can be used to significantly decrease the security of the underlying MLWE instance [BDH + 21]. The method of [OSPG18] contains a flaw because it checks the equality of the two masked ciphertext components c separately. The individual equalities are still sensitive, which was already noted by the authors of the masked Saber implementation [BDK + 21]. Luckily, the method permits a simple fix, by performing the test atomically for both ciphertext components. We take the same approach as [BDK + 21] and check whether By performing the hash atomically on the concatenation of both ciphertext components, the leakage present in [OSPG18] can be prevented [BDH + 21]. By implementing the masked equality test from [OSPG18], we limit this component to first-order side-channel security. While we prefer to use methods that extend to higher orders, the masked equality test of [BPO + 20] is not applicable to Saber or Kyber due to the ciphertext compression operation. Generalizing this component so that it is extensible to higher masking orders is left as future work.

Comparing Masking for Kyber vs Saber
In Figures 1 and 2, it can be seen that Kyber and Saber have highly similar masked architectures. The difference between MLWE and MLWR is apparent, in the extra XOF , Ψ η2 , and B2A q calls required to sample the error terms e 1 , and e 2 for Kyber. B2A q has roughly twice the complexity of B2A, essentially because SecAdd q makes two calls to SecAdd. Kyber further needs an additional B2A q conversion to convert the Boolean masking m {0:1} back to an arithmetic sharing mod q. This operation is 'free' for Saber, since the required share-wise left-shift 2 p −1 · m {0:1} already has the added effect of converting to an arithmetic sharing mod p implicitly. Using our new MaskedCompress q algorithm, masked ciphertext compression is remarkably similar for Saber and Kyber. For both, the involved non-linear operation is a power-of-two A2B conversion, where only the high-order bits of the resulting Boolean masking must be kept. However, Saber only requires a 13-bit A2B conversion, whereas Kyber-1024 requires a 24-bit conversion. Moreover, the conversion width for Kyber grows (logarithmically) with the number of shares.
Specialized hardware could be used to favor masking methods for either Saber or Kyber. In this work, our aim is to be generic and support masking for both schemes with identical hardware. Therefore, in Section 4.3, we implement a 32-bit Kogge-Stone SecAdd that supports A2B conversion for both Saber and Kyber. Especially Saber could benefit from a smaller and faster adder, or from A2B/B2A algorithms specialized for power-of-two moduli [BCZ18]. In Section 4.2, we describe a generic hardware architecture for masked binomial sampling. This architecture could in turn be optimized for Kyber, which uses smaller η than Saber.

HW Accelerators for Linear Operations
In a masked setting, all polynomial arithmetic is duplicated. A hardware accelerator for these operations is therefore increasingly important. As the goal of this work is to mask Kyber and Saber, we build a unified hardware accelerator that efficiently supports polynomial arithmetic for both schemes. A common hardware accelerator for arithmetic operations in Kyber and Saber needs to cover a wide range of different requirements. In this chapter, we present a novel NTT-based hardware accelerator that meets these requirements. Due to the generic design strategy, the developed architecture automatically covers a variety of other lattice-based schemes (see Table 1).

Number Theoretic Transform (NTT)
The NTT is an efficient method to reduce the complexity of the polynomial multiplication from O(n 2 ) to O(n log 2 (n)). It is a variant of the Fast Fourier Transform (FFT) with operations in the field Z q instead of the complex numbers.
Let a, s ∈ Z q /φ(x) be two ring polynomials of degree n − 1. Then the polynomial multiplication using the forward and inverse NTT can be computed with c = INVNTT(NTT(a) NTT(s)), where denotes the coefficient-wise multiplication.
In lattice-based cryptography, the product of a polynomial multiplication of length 2n is usually reduced by the cyclotomic polynomial φ(x) (frequently x n − 1 or x n + 1). The polynomial reduction by x n − 1 is also referred to as positive wrapped convolution and the reduction by x n + 1 as negative wrapped convolution. Let ω n ∈ Z q be the n-th root of unity with ω n n = 1 mod q and ω i n = 1 mod q for ∀i ∈ [0, n − 1]. The forward transform of the coefficients a i and the inverse transform ofâ i are computed witĥ where γ is the 2n-th root of unity γ n for negative wrapped convolutions and γ = 1 for positive wrapped convolutions. With pre-and postprocessing using the powers of γ a length-2n NTT with zero-padding can be avoided and a length-n NTT is sufficient.  Table 1 summarizes polynomial arithmetic parameters used in several lattice-based algorithms. While some schemes already use parameters suitable for the NTT, others choose a prime not suitable for a direct application of the NTT.

Design Rationale -NTT
NTT with prime lift. The original prime q can be lifted to any 'NTT friendly' prime q > n · q 2 for an NTT-based polynomial multiplication. The intermediate values and result of the polynomial multiplication have coefficients not larger than n · q 2 . If q is set sufficiently large, precision errors caused by the modular arithmetic are avoided [PNPM15]. After polynomial multiplication with the NTT, the coefficients can be reduced by the original prime q. Using signed arithmetic, the maximum absolute value of the coefficients during the computation is n · q 2 /4 when the coefficients are represented in [−q/2, q/2). Some schemes always multiply large polynomials with small polynomials sampled from the error distribution, allowing to further decrease the value of q . However, for schemes as NTRU, large polynomials with coefficients in [0, q) are multiplied with polynomials having the same coefficient range. As in this work all schemes of Table 1 shall be supported by the same hardware architecture and unsigned arithmetic is more suitable for hardware circuits, the rule q > n · q 2 is applied. All NTT-based schemes of Table 1 have primes smaller than 23 bits. To cover all ranges, in this work, we develop a flexible Montgomery multiplier for any prime up to 24 bits. For algorithms that are not NTT-based, a lifted prime q has to be found that covers the remaining algorithms. To allow an easy reduction, the Solinas prime q = 2 39 − 2 12 + 1 = 549755809793 is chosen. For this prime the condition q ≡ 1 mod 2n (the prime has the form q = 2 k p + 1) holds and the n-th as well as the 2n-th root of unity exists (e.g., for n ∈ [256, 512, 1024, 2048]).
Positive and negative wrapped convolution. Choosing γ = 1 or γ = γ n = √ ω n with ω n n = 1 mod q, ω n−1 n = γ n n = −1 mod q, and n = 2 k leads to positive and negative wrapped convolutions for NTT-based schemes, respectively. Lifting to a higher prime only works if no reduction errors are introduced during the convolution. Negacyclic convolutions involve negative intermediate results that lead to an erroneous output when reduced by q . These reductions can be avoided using signed arithmetic. For unsigned arithmetic, polynomial multiplications with polynomials of length n = 2n, zero-padding, and consecutive polynomial reduction by φ(x) can be used. Positive wrapped convolutions can still be realized with an NTT of length n = n.
Incomplete NTT. The prime q is usually chosen such that φ(x) can be factored into This allows the full application of the NTT and the basecase multiplication of two transformed polynomials corresponds to a simple coefficient-wise multiplication. The concept of the incomplete NTT for lattice-based cryptography was first proposed in [LS19] and a similar concept was later adopted to the second round Kyber specification. Kyber reduced its prime value (consequently key and ciphertext sizes) and chose a value where the n-th root of unity exists but not the 2n-th root of unity. This prevents applying a full NTT and only l − 1 layers of the NTT are applied resulting into n/2 polynomials of degree two. More precisely, the cyclic polynomial NTT algorithms. When exploiting symmetry, periodicity, and scale properties of the Fourier transformation, the complexity of Equation 17 can be reduced with an divideand-conquer approach from O(n 2 ) to O(n log 2 (n)). The two most common methods for splitting a large Fourier transform into smaller pieces are the Cooley-Tukey (CT) [CT65] and the Gentleman-Sande (GS) [GS66] algorithms. The butterfly operation, which is the main operation of these algorithms, consists of simple arithmetic in Z q . The Cooley-Tukey decimation-in-time (DIT) approach computes x ← x + y · ω and y ← x − y · ω with ω, x, y ∈ Z q and ω usually a power of ω n (also known as Twiddle factor). The Gentleman-Sande decimation-in-frequency (DIF) approach computes x ← x + y and Different in-place variants of the Cooley-Tukey and Gentleman-Sande algorithms exist, denoted as NTT CT br→no , NTT CT no→br , NTT GS br→no , and NTT GS no→br , where, e.g., no → br indicates that the input is in normal and the output in bit-reversed order. The bit-reversal can be completely avoided with a combination of the different variants NTT CT no→br and INVNTT GS br→no [POG15]. Likewise to previous works, we use different algorithms for the forward and inverse NTT to avoid the bit-reversal step, although the bit-reversal operation is simple in hardware. Using a DIT algorithm for the forward transform and a DIF algorithm for the inverse transform has the further advantage that the multiplications by the powers of γ n can be integrated into precomputed tables for the Twiddle factors.
Algorithms 14 and 15 illustrate the operations for our flexible NTT. Starting with the original NTT/INVNTT algorithms, we modify the algorithms to support an early abort for an incomplete NTT, as required by Kyber. The incomplete NTT can be activated using the early_abort signal. Moreover, we integrated support for either positive or negative wrapped convolutions. The wrapping method can be switched using the negacyclic signal. Thus, all schemes of Table 1 can use the same algorithms. Note that the INVNTT requires a final scaling by n −1 . For NTT-based schemes, the Twiddle table is stored in Montgomery domain in order to make use of a flexible Montgomery multiplier. In negacyclic NTT-based schemes, the Twiddle table contains n (n/2 at early aborts) merged values for the powers of ω n and γ n in bit-reversed order and the same amount of precomputed values for the inverse transform. For schemes with positive wraparound or schemes not based on NTT, n precomputed values of the powers of ω n are stored in the Twiddle table.

Architecture -NTT
Designing an efficient and flexible NTT with support of all mentioned lattice-based schemes requires new design approaches and multiple components.

NTT/INVNTT Address Unit.
It generates the two read and write addresses to load and store two coefficients as well as the read address for the Twiddle factor according to Algorithms 14 and 15. The signals ntt and invntt trigger the corresponding address computations. Optionally, early_abort and negacyclic can be set. The signal mont is used to select the number of pipeline stages to delay the write signals according to the delay in the arithmetic units.

Point Address Unit.
It computes the addresses for pointwise multiplications, additions, and subtractions. The signal basemul is used to select basecase multiplications at schemes with early abort. Let f, g ∈ Z q /φ(x) and let NTT(f ) • NTT(g) =f •ĝ =ĥ denote the basecase multiplication with n/2 products. These products are computed witĥ To ideally exploit the NTT hardware architecture, we split the basecase computation into four parts according to Algorithm 16. Each multiplication and addition step can be carried out in n/4 cycles (plus pipeline slack), whereas the address is incremented always by four.
Wrap Address Unit. This address unit is used for schemes not based on NTT to reduce the length-n polynomial product by φ(x) = x n + 1. At this negative wrapping, the lower part of the polynomial is subtracted by the higher part.
Generic Modular Multiplier. As stated previously, our proposed generic modular multiplier architecture supports Montgomery modular multiplications up to 24 bits. For multiplications with lifted primes for 'NTT unfriendly' schemes, it also supports modular multiplication using a reduction-friendly Solinas prime (2 39 − 2 12 + 1). While designing this dual multiplier, our objective has been to ensure that the architecture provides high operating frequency with pipelining support. Moreover, costly resources like FPGA DSP blocks are shared between the two multiplication modes.  The architecture of the proposed dual multiplier is shown in Figure 7. The multiplier takes a and b as input multiplicands. The design also requires the Montgomery modulus M and M = −M −1 mod R, where R = 2 24 was chosen. The control input mont determines whether a Montgomery multiplication or multiplication modulo Solinas prime is executed. In Figure 7, the modules with color blue are shared between both multiplication modes, modules with color dark gray are dedicated modules for multiplications modulo Solinas prime, and modules with color light gray are dedicated modules for Montgomery multiplications. As we can see, the DSP blocks and a few multiplexers are part of the shared resources, whereas the dedicated modules contain mainly adders and subtractors. The adders and subtractors are implemented by efficient usage of fast carry chains [KG16]. The reduction logic for multiplications modulo Solinas prime is implemented using the target prime structure and involves only two additions and three subtraction operations. To allow a high operating frequency, the Montgomery and Solinas multiplier have pipeline registers included (12 and 6 stages respectively). For simplicity, the pipelining registers are not shown in Figure 7.

Results -NTT
All resource utilization and frequency results of this work are extracted after place and route phase using Xilinx Vivado. The chosen platform of this work is the NewAE Technology Target Board CW305 equipped with an Artix 7 FPGA XC7A100T. Table 2 compares flexible NTT designs of previous works with our design. However, to the best of our knowledge, none of the previous works provides a similar level of flexibility. Our design supports the following features: 1) configurable on runtime; 2) the highest parameter range covering all mentioned lattice-based algorithms (n up to 4096, q up to 39 bits); 3) positive and negative convolutions; 4) early abort; 5) pointwise multiplications, additions, and subtractions.
The amount of clock cycles of our NTT architecture is 2n · log(n) plus 14 or 8 cycles latency depending on whether Montgomery or Solinas prime reductions are performed. Previous works, such as [FS19, FSS20], take advantage of the small coefficient size of some schemes and pack two coefficients in one memory line. As we also want to support large coefficient widths, we decided to not store two coefficients in one word and also to not compute two parallel butterfly computations. The cycle count can be further reduced by using multiple data RAM blocks (e.g. 8 in [BUC19]) to reduce the memory access bottleneck. This allows to load and process multiple coefficients in parallel. As shown in [MKÖ + 20], this can significantly reduce the cycle count. However, using multiple RAM blocks and butterfly units gets extremely expensive in terms of area and also increases the design complexity.
Due to the power-of-two modulus for Saber, using our generic NTT for polynomial multiplications is not a natural choice. Therefore, we compare our design with alternative multiplier strategies in Table 3. [MTK + 20] presented a Saber co-processor for multiplications based on the Toom-Cook algorithm. Although their design has a similar LUT and FF consumption, they use significantly more DSP slices. Moreover, our architecture is more flexible and supports different parameter sets. In [RB20], a high-speed schoolbook multiplier with reductions by the prime 7681 is implemented that makes use of 256 multiply-accumulate units. Although it is much faster, it also comes with a huge resource overhead and less flexibility compared to our approach. In [BR20], the approaches of [RB20] have been further extended and optimized to require fewer resources, but also omitting the prime reduction capability. [ZZY + 20] presented a high-speed multiplier based on the Karatsuba algorithm that again, is faster but also at a much higher resource cost.
For comparison, metrics like latency × area can be used to rank the efficiency of designs, as multiple metrics are converted into a single value. In case of Table 3, however, we decided to omit such a comparison. Converting the DSPs into, e.g., LUTs, can be misleading as only a fraction of a DSP's functionality is actually used for multiplication. Besides that, the designs in [RB20, BR20, ZZY + 20] have been implemented on Xilinx UltraScale+ FPGAs, that come with different DSPs (DSP48E2) than the Artix-7 fabric we used (DSP48E1). For example, [BR20] explicitly states that their optimization requires modern DSPs with larger operand width.
Although there are much faster alternatives for the polynomial multiplication when optimizing for Saber, the high resource cost of the designs is not suitable for our embedded scenario. With a co-design, the extremely low latencies would not have such a strong influence on the overall performance. Our design provides an appropriate balance between resource cost and performance, and at the same time supports a wide range of parameters. When optimizing for a specific algorithm, our design would even require fewer resources. For instance, only supporting Saber or Kyber requires fewer BRAM resources as the polynomial length is small compared to the other schemes in Table 1. Moreover, fewer address units and pipeline registers would be required.

HW Accelerators for Non-Linear Operations
In this section, we describe hardware architectures for the non-linear operations of Kyber and Saber. These operations need to combine information from both shares and therefore require special treatment in a masked design. In contrast to the NTT accelerator, the accelerators proposed in this section are designed for a tight processor coupling.

Masking Keccak
Most lattice-based NIST schemes use the Keccak functions SHA3 and SHAKE to create hash outputs and pseudo-random numbers. Keccak hardware implementations have a particularly good energy efficiency for random number generation because Keccak generates a high amount of bits per round [BUC19]. The core operation of the Keccak algorithms is the Keccak state permutation function f-1600. One round of this function can be split into the following operations: Theta (θ), Rho (ρ), Pi (π), Chi (χ), and Iota (ι). While Theta, Rho, Pi, and Iota are linear functions consisting of XOR and rotation operations, the function Chi is a non-linear operation, which additionally requires AND as well as NOT operations. The linear functions can be performed on the shares individually. Therefore, we discuss in the following only the non-linear function Chi in more detail.

Chi operation (χ). The Keccak state can be represented as a three-dimensional array
If the operations in Equations 19 and 20 are executed from left to right, the authors in [BDPVA10] argue that all intermediate computations are independent of native variables. Instead of using fresh randomness, different parts of the state are reused to form independent computations.
To accelerate the computations in Equations 19 and 20, we developed the hardware design of Figure 8. The accelerator consists of three steps. In the first step, for each share, five 32-bit lanes of a fixed y coordinate are loaded via a secure address decoder into two separated register files. Depending on the address value, the input in 1 and in 2 are either stored in the registers Reg A1 or Reg A2. In the second step, the Chi operation is computed. Therefore, Equation 19 is split into two parts: . Equation 20 is split in the same way. While the first part contains only computations with a single share, the second part includes both shares. However, the critical shares are already blinded by independent state bits. To avoid leakages due to glitch effects, the computations are separated by registers. Finally, the result of the Chi operation is written to the output and the next 32-bit lanes can be loaded. This procedure requires 2 × 5 repetitions until the whole Chi operation is performed. Loading the complete states into an accelerator would only lead to a small performance improvement as the actual Chi computation of the proposed accelerator requires only two cycles. However, it would significantly increase the area costs.

Masking Binomial Sampling
Many efficient LWE-based schemes require sampling from a centered binomial distribution. A centered binomial sample can be retrieved by with x i and x i denoting the bits of uniformly distributed η-bit integers. The authors in [SPOG19] proposed two different masked sampling methods for software implementations. The first method, which is based on [OSPG18], was specially designed for first-order masked implementations. The second method turns the input into a bit-sliced representation and computes the Hamming weight of x and x using a secure AND function.
Bit-slicing accelerator. The bit-slicing method allows computing multiple samples in parallel. Although still more efficient than non-bit-slicing approaches, the conversion from the Keccak output into bit-sliced format turns out to be relatively costly in software if the sampling is performed according to the specification of, e.g., Kyber or Saber [BDK + 21]. However, in hardware turning the Keccak output into bit-sliced format corresponds to a simple rewiring. Figure 9 shows the top-level architecture of the bit-slicing accelerator. The uniformly distributed Keccak squeeze is stored in up to 2η max registers within the accelerator with η max = 5. The transformation in the accelerator is Masked adder tree based on TI. Let the sum and carry computations in the adder tree be split into two functions f 1 : While the linear functions in these equations can always be computed with a single share, for the non-linear functions, at least one share is always missing during the computations (non-completeness property). When converting the proposed adder tree using TI principles and the functions f 1 and f 2 , the architecture of Figure 10 (b) is obtained. It is not possible to fulfill the uniformity property of a non-linear Boolean operation that has two inputs and one output [NRS11]. Therefore, the uniformity property for each output of the function f 2 needs to be recovered using fresh randomness. Changing the adder tree to use full adders where three-input operations are used to avoid the refreshing step is theoretically possible. However, such an architecture would lose the flexibility as for each η another circuit would be required. Therefore, another alternative to reduce the randomness requirements is investigated.
Masked adder tree based on DOM. When the uniformity property is preserved, secure TI implementations can be realized with a low amount of randomness. As this is not the case for our adder tree and the generation of fresh randomness is in most platforms expensive, we investigate the behavior of the adder tree architecture with DOM principles. As shown in Figure 10 (c), the DOM approach significantly reduces the complexity. Instead of three instances of f 1 and 3 · (η max − 1) instances of f 2 in each level only two and η max − 1 440Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography instances are required, respectively, when using the DOM approach. The computation c i = (c i−1 ∧ z i−1 ) in f 2 -DOM is realized with the secure DOM-AND. For the generation of 32 binomially distributed coefficients, the adder tree based on TI requires 4·η max ·(η max −1) random 32-bit values plus (2 · η max ) values for randomizing the zero-input of z. In contrast, the DOM approach requires 2 · η max · (η max − 1) plus η max random values. For instance, with η max = 5 the amount of randomness reduces from 90 × 32-bit to 45 × 32-bit.

Secure Adder
The secure arithmetic addition for masked Boolean shares is an essential element for the generic B2A and A2B conversions. Two secure adder designs based on the ripple-carry adder and a pipelined Kogge-Stone adder were proposed in [SMG15]. The Kogge-Stone adder achieves a lower latency as it belongs to the class of carry-lookahead adders. It splits the carry computation into a generate and propagate part. Due to its good performance, the suggested Kogge-Stone adder suits well to our application and the proposed architecture was adopted for our design. The TI-based Kogge-Stone adder for three shares, shown in Figure 11, is constructed using three stages for performing 4-bit additions.
and the non-linear function f 2 : (x The remaining stages require f 2 and f 3 : (g i+j with j = 2 stage−1 and While the first stage requires further randomness for recovering the uniformity property after f 2 , the remaining stages can use the independent bit values of g 0 i instead of r i to keep uniformity. Table 4 summarizes the resource utilization and performance of our HW accelerators presented in this section evaluated for the Artix 7 FPGA XC7A100T. Critical signals and components that involve non-linear operations were defined with the Verilog dont_touch attribute, preventing the synthesis tool from optimizations. Without this attribute, a lower resource utilization and better performance can be expected. Nevertheless, we chose the safe option and accept these drawbacks.

Results -Non-Linear Accelerators
The cycle counts in Table 4 are the latencies within the accelerator. Cycles for loading, storing, and clearing the input/output operands are excluded. From a system perspective, the accelerator latencies are partially hidden by the loading operations. As an example, consider the masked adder trees for the binomial sampling accelerator shown in Figure 10. If the shares associated with x 4 are the last operands to be loaded, all previous stages can   [FSS20] 3, 847 0 To the best of our knowledge, no HW/SW codesign of Keccak that supports masked and non-masked operations was published so far. The fully masked HW designs [BDPVA10, GSM17, ABP + 18] report only ASIC results in gate equivalents making it difficult to compare with our FPGA results. The tightly-coupled f-1600 accelerator in [FSS20] only supports non-masked computations. Our f-1600 accelerator supports complete round computations for non-masked and incomplete round computations (only Theta, Rho, Pi) for masked operations. The Chi accelerator is used to securely accelerate the non-linear operation of Keccak.
The results show that the DOM variant of the binomial sampling accelerator (Binom Tree) does not only decrease the amount of randomness that is required but also leads to a significant area reduction when compared to the TI variant. Therefore, in the remainder of this article, only the DOM variant is considered for further measurements.
Compared to [SMG15], our secure adder is very similar. Both are designed for 32bit operations. The higher resource consumption can be explained by the dont_touch attributes, an additional secure address decoder and an additional feature that allows to compute 32-bit additions with and without input carry. While Saber requires only 16-bit and Kyber 24-bit additions (in the compression, see Section 2.2), the computations with input carry can become important for schemes with a larger parameter set or higher-order masking. Note that if the input carry is enabled, our adder takes one additional clock cycle.

RISC-V is an open ISA based on the Reduced Instruction Set Computer (RISC) principles.
Due to its open-source character, RISC-V has meanwhile achieved a wide distribution in academia but also in industry. Several open-sourced RISC-V processor designs were proposed in the last years. One of the most popular processors is the 32-bit solution CV32E40P (formerly RI5CY) from the Parallel Ultra Low Power (PULP) project 8 . The CV32E40P core, originally developed by ETH Zürich and the University of Bologna, is an in-order execution core with four pipeline stages. It supports the complete base integer instruction set (I) and the extensions for compressed instructions (C) as well as multiplication instructions (M). Optionally, the extension for single-precision floating-point instructions (F) can be used. Additionally, the core features some custom ISA extensions such as hardware loops, post-incrementing load and store operations, and bit-manipulation operations in order to optimize the core for low-power signal processing applications.
Without further optimization for post-quantum applications, the processor's performance is significantly lower compared to the performance of the popular ARM Cortex-M4 [FSS20], which is probably the most used embedded evaluation platform for cryptography in academia. Nevertheless, the core is completely written in SystemVerilog and highly suitable for custom extensions and core modifications. This makes the core well suited for the evaluation of our accelerators developed in this project. Figure 12 shows the architecture of our system. Its main components are a RISC-V processor including our tightly coupled accelerators (Keccak, Chi, Bit-slice, Binom Tree, Secure Adder), a loosely coupled NTT accelerator, an instruction memory, one data memory (optional second data memory), and a set of peripherals (UART, SPI, I2C, GPIO).
In addition to our custom accelerators, the RISC-V core includes the following components: prefetch buffer, instruction decoder, General Purpose Register (GPR), Floating Point Register (FPR), ALU, Control Status Register (CSR), multiplication unit, Load Store Unit (LSU).
As we make use of the optional FPR for the Keccak f-1600 accelerator, the instruction decoder also has to support the corresponding load and store instructions. Apart from that, we do not make use of any floating-point extensions and thus, we did not include the dedicated Floating Point Unit (FPU) or support of any other floating-point instructions.

Architectural Leakage Reduction
Storing two shares in the same register file can lead to exploitable leakages, even if both shares are not accessed simultaneously [SR15]. The reason is that the registers can be connected to the same internal bus and combinatorial circuit. Although influencing the performance, in this work, only one share is located within the registers at one time step. Before processing the second share, the first share is cleared.
At the non-linear accelerators in the EX stage, the shares are stored always in different register files. Optimizations are turned off by dont_touch attributes. Register values are only accessed via a secure address decoder, which first computes a select signal before accessing one of the register banks. Moreover, addresses are one-hot encoded to avoid problems during address switches.
The pipeline registers between the ID and EX stage are another typical source of leakage at the transition of operations with another share. This affects three operand registers for the ALU, multiplier unit, and post-quantum accelerators, respectively. These pipeline registers must be cleared after critical operations. Moreover, the serial divider, which is capable of performing divisions and remainder computations, contains pipeline registers that have to be cleared to avoid leakages.
In an FPGA design, the instruction and data memories are constructed using BRAM resources. The main elements of a BRAM are an input register, memory array, output latch, and an optional output register to improve the critical path. Overwriting one of the registers/latches with another share can lead to exploitable leakages [BDGH15]. The routing nets in the memory array have buffers to improve the signal quality. Charging and discharging the nets can thus lead to amplified leakages. To avoid such effects a separated second data memory is placed in the design. It can be optionally used to clearly separate the shares for critical operations. Variables can be relocated using the section attribute of the compiler.

Accelerator Integration
In the presented architecture, two different accelerator types are used. While the NTT accelerator is loosely coupled to the processor and connected to an AXI bus, the Keccak, Chi, Bit-slice, Binom Tree, and Secure Adder accelerators are tightly coupled and directly integrated into the RISC-V processor.
The authors in [FSS20] and [AEL + 20] have shown that the NTT is also well suited for tightly coupled accelerators. However, these previous works focused on schemes with 16-bit coefficients. In this work, bit sizes up to 39-bit are supported to cover all main lattice-based schemes. The 39-bit NTT operations are not very suitable for ISA extensions because two registers or memory lines would be required in a 32-bit architecture for a single operand. This doubles load/store latencies and complicates instruction encodings. Computing the convolution using the Chinese Remainder Theorem (CRT) turned out to be less efficient [FSS20]. A loosely coupled approach is therefore the preferred solution to clearly separate the 39-bit operations within the NTT and the 32-bit operations of the processor.
The accelerator configuration registers and NTT memory are memory-mapped. Table 9 of Appendix A summarizes the memory map of the platform. The addresses starting at 0x1B10 8000 include: i) the parameters offset_1, offset_2, offset_3, n −1 , q, andq, ii) a configuration register containing the polynomial length n and the configuration signals mont, negacyclic, early_abort, ntt, invntt, pointwise, basemul, wrapping, mul_ninv (see Section 3.3).
Similar to [FSS20], the Keccak accelerator for the f-1600 round function is placed in the ID stage because this accelerator requires parallel access of 50 registers, which are in the same processor stage. To be more precise, the temporary registers t0-t6, the saved registers s1-s11, and the floating point registers f0-f31 are connected in parallel to the f-1600 accelerator. All remaining accelerators require at most three input and one output operand and are placed similar as the ALU in the EX stage. Table 9 also provides an overview about the ISA extensions developed in this work. All instructions are mapped to the opcode 0x77. The instructions are all single-cycled except of pq.mbinc, pq.mbincinv, pq.mchic, pq.maddc, and pq.maddcc (see Section 4.4).
The Keccak instruction can be configured to perform complete and incomplete rounds. The register of rs1 controls this configuration together with the reset functionality. Register rs2 is used for the Keccak round selection. The remaining accelerators have write instructions (input in rs1/rs2, address in rd) and read instructions (output and address in rd) to securely copy the shares between register file and accelerator. In addition to the compute operation, the Binom Tree accelerator has instructions for resetting z {0:2} and copying the sum s {0:2} to the input z {0:2} . The instruction pq.mbincinv is used for computing the subtraction. Table 5 states the resource consumption and performance of the whole RISC-V system as shown in Figure 12 for three different configurations and provides a comparison to related works.

Results -System Integration
The first configuration is the standalone architecture consisting only of the basic RISC-V system without any accelerators, FPU or FPR. This serves as baseline for comparison with our accelerator extensions.
The second configuration consists of the accelerated architecture that includes the loosely coupled NTT and the tightly coupled Keccak accelerator. The FPR is enabled as it is specifically used to store the Keccak state and thus, can be considered being part of the accelerator. This accounts for roughly 400 LUTs and 1026 FFs (125 Slices). Compared to the standalone version, the LUT consumption increases by a factor of 1.59, the FF and Slice consumption by a factor of 1.42 and 1.42, respectively. As the longest path in the design lies not within the accelerators, the maximum frequency remains at 62 MHz.
The third configuration further enables the masked accelerators, i.e. the Secure Adder, Binom Tree, Chi, and Bit-slice accelerators. In addition to that, the optional second data memory is instantiated to allow domain separation of the data shares and thus, accounts for the increase of BRAM usage. Compared to the accelerated version, the LUT/FF/Slice consumption increases by a factor of 1.44/1.45/1.41. Although the accelerators still do not contain the longest path in the design, the maximum frequency slightly decreases. This is most likely caused by the reduced routing capabilities due to increased resource consumption.
When comparing our accelerated version with [FSS20], it can be observed that the amount of LUTs is lower and the amount of FFs slightly higher. Although our NTT multiplier supports a wider input range, we require fewer DSP slices. This can be explained by the manually optimized DSP mapping and because only a single multiplier instead of two multipliers in parallel are used. Due to the loose NTT accelerator coupling, the BRAM utilization increased. A direct resource comparison to [AEL + 20] is barely possible as a completely different platform was used. However, the resource overhead in [AEL + 20] is expected to be smaller as only a single Barrett multiplier is added to the original core. VexRiscv is a completely different RISC-V platform. As the presented resources of this work and [FSS20] include the whole PULPino platform (with UART, SPI, I2C, . . . ) a direct comparison is not possible. b) Frequency was only evaluated for ASIC (see Table 6).  [FSS20]. This choice trades performance in favor of a low power and energy consumption and is thus well suited for embedded devices. The overhead for our accelerated and masked design behaves similarly as for the FPGA synthesis. Compared to [FSS20], our accelerated design requires a similar amount of combinatorial cells and about 28 % more sequential cells. The memory size is due to the loosely coupled NTT about 33 % larger as in [FSS20]. While FPGAs offer a high amount of BRAMs with dual port capabilities, for ASIC designs, memory is usually very costly. However, one advantage of an ASIC design is the higher flexibility as also the unusual word length of 39-bit can be directly supported. Our NTT design uses one dual port RAM (207, 178 µm 2 ) for the coefficients and one single port RAM (115, 812 µm 2 ) for the Twiddle factors (each of size 4k × 39-bit). When only Kyber and Saber are targeted also smaller memory sizes would be sufficient. For example, for a 1k × 39-bit single port RAM the area reduces to 37, 526 µm 2 .
In addition to the memory blocks, a further challenge for converting our FPGA to an ASIC design is the DSP optimized dual multiplier of the NTT discussed in Section 3.2. The asymmetric structure of the DSP multipliers is also efficiently realizable with the Cadence ChipWare multiplier. To verify this, we compared an asymmetric 26 × 18 with a symmetric 22 × 22 multiplier. The critical path is for both 8.3 ns and the cell count is only slightly different 902 (symmetric) and 934 (asymmetric). For this reason, we decided to replace the DSP slices directly with the asymmetric ChipWare multipliers. Finally, some manually FPGA optimized carry chain instances were modified to generic primitives.

Experimental Results
This section provides an overview of the performance results for the optimized non-masked and masked implementations of Kyber and Saber, and the leakage assessment of our routines and accelerators.

Performance Unmasked Implementations
We evaluated the cycle count for Kyber and Saber with different parameter sets (NIST Levels I, III, V). Our source code for the accelerated implementations was compiled with optimization flag -O3. Table 7 summarizes our benchmark results and provides a comparison to related works. For the non-masked accelerated version, only the loosely coupled generic NTT unit and the Keccak f-1600 accelerator are used.
The cycle count comparison between our work and the pure software RISC-V implementations in [FSS20] and [Gre20] show that the integration of hardware accelerators and ISA extensions can lead to clear improvements. Also, the assembly-optimized implementations with the superior ARM Cortex-M4 instruction set cannot compete with our codesign. We achieve cycle count improvement factors of 3.47 for Kyber-768 and 2.63 for Saber (whole algorithm execution).
The proposed ISA extension for finite field operations in [AEL + 20] already achieves a good cycle count reduction. However, the stronger accelerators of our work and the additional integration of a Keccak accelerator show a further major reduction, e.g, improvement factor of 7.06 for Kyber-1024 compared to [AEL + 20] (whole algorithm execution). Clearly, it has to be noted that more powerful accelerators are larger, which, however, is justified by the achieved performance gain.
When compared to the RISC-V design in [FSS20], we achieved a performance improvement factor of 1.14 for Kyber-768 and 3.30 for Saber (whole algorithm execution). Due to the genericity of our NTT unit, a clear performance advantage for the non-NTT based scheme Saber gets visible. While the tightly coupled NTT design in [FSS20] can be used for Saber only with a costly CRT decomposition and is thus not faster as accelerated Karatsuba/Toom-Cook approaches, our work is directly suitable for a variety of lattice schemes without any hardware changes. Although [FSS20] tailored their design for the small coefficient size of Kyber and two butterfly operations are computed in parallel, we still achieved slightly better performance results. This is mainly achieved due to the flexible and efficient basecase multiplication for incomplete NTTs that is directly integrated within the accelerator. If optimizing for a single NTT-based scheme, like Kyber, the tightly coupled approach has also some advantages including the reduced communication overhead between core and accelerator and the better access to the system memory.
Further cycle count improvements can be achieved with co-processor solutions where the main processor is mostly used for configuration purposes as in Sapphire [BUC19] and VPQC [XHY + 20]. These almost standalone solutions compute large parts of the complete scheme within the accelerator. However, we focus on a solution that uses the RISC-V processor as the main computing element to keep the flexibility high. This facilitates spontaneous algorithmic changes and the integration of SCA countermeasures.
It has to be noted that the matrix-vector multiplications in MLWE/MLWR schemes require to multiply different ring elements from the matrix with always the same vector. To optimize the AXI communication overhead and the NTT computation costs, we leave the transformed vector within the NTT memory. Moreover, we only load the result from the NTT memory when subsequent operations like polynomial additions/subtractions are completed. For Saber, the number of NTT calls could be further reduced when deviating from the specifications and test vectors. For example, the public matrix A in Kyber is after the sampling already assumed to be in the NTT domain and ring elements are transferred in the NTT domain.
Only small deviations of the code size are visible when compared to the ISA extensions in [FSS20]. When compared to a baseline implementation on RISC-V, the code size is still significantly smaller as more complex operations are performed with fewer instructions.

Performance Masked Implementations
This section provides an overview of our results for the masked Kyber and Saber implementations and compares to prior and concurrent works [OSPG18, BDK + 21, BGR + 21]. The masked RLWE implementation presented in [OSPG18] is based on the NewHope algorithm, which has many similarities to Kyber. Both are NTT-based and use a prime modulus,  leading to similar masking requirements and approaches. The masked RLWE scheme in [OSPG18] can be categorized to NIST Level V. Although the comparison between ARM Cortex-M4 and the deployed RISC-V platform is difficult, the measurements in Table 8 indicate that our accelerators and masking methods lead to a significantly lower cycle count.
When comparing to the masked Saber implementation in [BDK + 21], we achieve a cycle count improvement of factor 3.10 (including randomness generation) for the masked Saber implementation when using our proposed accelerators. It is also important to mention that our accelerators are designed for flexibility and the non-linear algorithms are easier to extend to higher-order masking schemes than in [BDK + 21]. More specialized accelerators might lead to further speed improvements. However, in this work, we focus on controlled executions of non-linear operations in hardware and a high flexibility. A more detailed analysis of the masked decapsulation operation can be found in Appendix B.
We also compare to the masked Kyber implementation of Bos et al. [BGR + 21] targeting the ARM Cortex-M0, which was published concurrently with our work. The M0 is an exceedingly energy-efficient and resource-constrained platform, and a direct comparison with RISC-V is again difficult. In absolute cycle counts, our implementation is a factor 9.9 faster. In contrast, the authors achieve a smaller overhead factor of 2.21 for masking Kyber, but also start from an unoptimized plain C implementation as the reference. The work also includes higher-order measurements, where the overhead factor increases greatly. From an algorithmic viewpoint, an important difference between our work and [BGR + 21] is in the ciphertext compression and subsequent equality test. We proposed a novel MaskedCompress q routine followed by the masked equality test, whereas Bos et al. opt to compute a masked DecompressedComparison. One of the motivations for the latter was that no masked compression algorithm existed. As such, it remains interesting future work to consolidate and compare these approaches.
For our target RISC-V platform and implementation, Kyber proves more costly to mask than Saber. As explained in Section 2.4, this is partly due to the more complicated primemoduli masking algorithms and additional masked error sampling of Kyber. However, our efficient masking accelerators compute these algorithms in minimal cycles (Table 4), thereby greatly reducing this algorithmic overhead. These non-linear algorithms can be expected to be significantly slower in a pure software implementation, and accordingly the overhead of masking schemes with prime moduli is expected to be higher. Another large contributing factor is the generation of the randomness required for the masking, for which we use a 32-byte seed and expand it with SHAKE-128 using our Keccak accelerator. While Saber can directly use the Keccak squeeze, Kyber requires partly an additional rejection sampling to obtain uniform randomness modulo q. As a result, e.g., Kyber-768 requires roughly 17.5 times more cycles to generate the initial randomness compared to Saber. 450Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography

Side-Channel Leakage Evaluation
In this section, we perform a side-channel leakage evaluation of all non-linear operations discussed in this article. These operations are critical as they need to process both shares at the same time. We describe the applied leakage evaluation method, namely the Test Vector Leakage Assessment (TVLA), give details about our measurement setup, and finally provide evaluation results for each operation given a total of 100, 000 side-channel measurements each.
Test Vector Leakage Assessment (TVLA). The Test Vector Leakage Assessment (TVLA) [GJJR11,SM15] methodology has been established to statistically evaluate the presence of side-channel leakage without prior knowledge about the investigated implementation. Given two sets of data Q 0 and Q 1 , Welch's t-test is used to evaluate if the respective means µ 0 and µ 1 significantly differ from each other. The resulting metric of the TVLA, called t-value, is calculated as with variances s 2 0 , s 2 1 and n 0 , n 1 denoting the cardinalities of the two sets. A high t-value indicates that the null hypothesis (both sets were drawn from the same distribution) is rejected, which implies that it is possible for an attacker to statistically distinguish both sets. This is taken as an indicator for side-channel leakage. In literature, a threshold of |t| > 4.5 is usually defined to reject the null hypothesis with a confidence greater than 99.999%.
In order to perform leakage evaluations, the 'non-specific' or 'fixed-vs.-random' t-test can be applied: The evaluator measures the power consumption of multiple algorithm executions with a Boolean or arithmetic masked fixed input x f ixed = x 0 + x 1 and with a randomly masked input x rand = x 0 + x 1 . Measurements are then split into a set Q 0 with fixed input data and a set Q 1 with random input data. Finally, given these two sets, the t-value according to Equation 33 is calculated for each point in time. A resulting t-value outside the confidence interval (|t| > 4.5) indicates that both sets can be distinguished and therefore the implementation exhibits side-channel leakage, which can potentially be used to mount an attack. Otherwise, the implementation can be considered to withstand first-order univariate attacks with the evaluated amount of measurements.

Measurement setup.
We implemented our RISC-V design (cf. Section 5) on a NewAE CW305 target board that features an Artix-7 FPGA (XC7A100T). The RISC-V core clock frequency was set to 10 MHz for all side-channel measurements. The SPI interface of the RISC-V platform is used to load the different test programs into the instruction and data memory. The SPI stimuli were created using the GCC PULPino RISC-V compiler (version 7.1.12017050). For all side-channel and performance measurements, the non-linear routines and the accesses of the HW accelerators were manually optimized in assembly. This allows full control of the execution order of instructions and can ensure that the shares are correctly cleaned in order to have only one share at a time within the processor pipeline and register files. The input data according to the TVLA methodology is transferred from the measurement PC to the RISC-V platform through the UART interface.
We acquire side-channel measurements through the SMA connector of the CW305 board with a Picoscope 6402D USB oscilloscope at a sampling frequency of 156.25 MHz 9 . These power measurements correspond to the FPGA's internal supply voltage measured over the integrated 100 mΩ shunt resistor amplified by a 20 dB low-noise amplifier. A dedicated trigger mapped to the RISC-V GPIO port is used to indicate the correct time frame for the measurements. For all TVLA evaluations, a total amount of 100, 000 traces were recorded 10 .
Evaluation Results. To practically validate the first-order SCA resistance of our hardware architectures and the non-linear operations, we applied the TVLA method as described earlier in this section. In order to verify the measurement setup, each leakage test is performed twice: once with activated Random Number Generator (RNG) and once with deactivated RNG. The results are shown in Figure 13. It can be clearly seen that the resulting t-values contain high peaks far above the confidence boundary of |t| > 4.5 for the tested operations when turning the RNG off. This validates the setup and shows that all considered operations are leaking information in an unmasked setting or with deactivated RNG.
To cover all accelerators and non-linear operations, which require to process two shares at the same time, we performed the following tests: i) masked Keccak SHAKE-128 (includes f-1600 and Chi accelerator), ii) masked binomial sampling Ψ 4 (includes Bit-slicing and Binom Tree accelerators), iii) masked B2A (includes Secure Adder accelerator), iv) masked B2A q (includes Secure Adder accelerator), v) MaskedCompress q (includes Secure Adder accelerator). Note that the experiment for the compression (Algorithm 13) includes the A2B conversion. Thus, all non-linear operations discussed in Section 2 for masking Kyber and Saber are covered by our experiments. Except for the masked Keccak, all experiments with non-linear operations were performed with 32 polynomial coefficients, which is one function call of the bit-sliced binomial sampler. The masked binomial sampling was measured with Saber parameters η = 4. To cover the less critical linear polynomial arithmetic and our loosely coupled NTT accelerator, we provide TVLA results (Figure 14, Appendix C) for the polynomial multiplication s · u T using NTT and Kyber parameters.
The evaluation results with the RNG turned on show that all implementations stay within the confidence boundary of |t| < 4.5. This validates the univariate first-order SCA resistance of the non-linear functions, and therefore all corresponding accelerators, for the given amount of measurement traces. We want to emphasize that these results are valid given our used measurement setup. It is still possible that there is exploitable leakage detectable with an increased amount of measurements. In addition, an attacker could use a different setup, e.g., (localized) EM measurements in combination with an increased sampling frequency to spatially increase the SNR. Therefore, additional experiments could be needed if protection against a stronger adversary is required. We leave this evaluation as future work.

Conclusion
Attacks on the implementation of a cryptographic algorithm are a major concern in cryptography as these attacks allow to break mathematically secure algorithms using sidechannel information. Masking methods can be a powerful countermeasure against SCA, even if the attacker has access to the physical device. In the last years, there have been some first works about masking methods for PQC. However, for most PQC finalists the design cost for a secure implementation is still missing. In this work, we presented generic hardware accelerators for the linear and non-linear operations of masked lattice-based cryptography, with a particular focus on Saber and Kyber. Although NTT designs have been a research target in the last years, so far, no generic HW solutions were proposed. Our novel NTT architecture supports positive/negative wraparounds, incomplete NTTs, and prime lifts for non-NTT based schemes, achieving fast polynomial arithmetic for a variety of lattice schemes. Non-linear operations, which involve the processing of two shares at the same time, were accelerated with tightly coupled design solutions for a controlled and efficient execution. These accelerators include the Keccak Chi, the binomial sampling with bit-slicing, and secure addition operations. All accelerators were integrated into a RISC-V platform and ISA extensions were developed to access the accelerators. Due to a novel masked ciphertext compression algorithm and the flexibility of our design, schemes with a power-of-two as well as a non-power-of-two modulus with quite different masking operations can be supported. As a proof of concept, we propose masked implementations of Kyber and Saber. Our generic architecture supports masking for both schemes with the same hardware accelerators. Future work could identify where these accelerators can be optimized in case only one scheme needs to be supported. For example, dedicated accelerators could take advantage of the power-of-two modulus of Saber to speed-up masked decapsulation. Additionally, most of the implemented algorithms extend readily to higher-order side-channel security. Expanding the implementation to use more shares is therefore a clear next step for future research.  Table 10 presents the detailed cycle count for the decapsulation of Kyber-768 and Saber. We present our masked SW reference for comparison, but note that most of the masking algorithms are ill-suited for a plain SW implementation. For example, B2A/A2B based on SecAdd require many operations because they process single bits, and more efficient plain SW implementations exist. The CPA.Dec operation mainly consists of polynomial arithmetic, therefore, the NTT Unit leads to a significant improvement. The CPA.Enc operation benefits even more from the proposed accelerators, due to the accelerators for sampling and the B2A/A2B conversions. The remaining operations of the decapsulation strongly benefit from the Keccak f-1600 accelerator.