Masking Kyber: First- and Higher-Order Implementations

. In the ﬁnal phase of the post-quantum cryptography standardization eﬀort, the focus has been extended to include the side-channel resistance of the candidates. While some schemes have been already extensively analyzed in this regard, there is no such study yet of the ﬁnalist Kyber. In this work, we demonstrate the ﬁrst completely masked implementation of Kyber which is protected against ﬁrst- and higher-order attacks. To the best of our knowledge, this results in the ﬁrst higher-order masked implementation of any post-quantum secure key encapsulation mechanism algorithm. This is realized by introducing two new techniques. First, we propose a higher-order algorithm for the one-bit compression operation. This is based on a masked bit-sliced binary-search that can be applied to prime moduli. Second, we propose a technique which enables one to compare uncompressed masked polynomials with compressed public polynomials. This avoids the costly masking of the ciphertext compression while being able to be instantiated at arbitrary orders. We show performance results for ﬁrst-, second- and third-order protected implementations on the Arm Cortex-M0+ and Cortex-M4F. Notably, our implementation of ﬁrst-order masked Kyber decapsulation requires 3 . 1 million cycles on the Cortex-M4F. This is a factor 3 . 5 overhead compared to the unprotected optimized implementation in pqm4. We experimentally show that the ﬁrst-order implementation of our new modules on the Cortex-M0+ is hardened against attacks using 100 000 traces and mechanically verify the security in a ﬁne-grained leakage model using the veriﬁcation tool scVerif.


Introduction
Public-key cryptography is based on conjectured-to-be-hard mathematical problems. The most widely used examples are RSA, based on the integer factorization problem, and elliptic curve cryptography, based on the discrete logarithm problem. Both are vulnerable to polynomial-time attacks using a quantum computer [Sho94,PZ03].
To defend against this threat, research is directing its attention to post-quantum cryptography (PQC). To streamline this effort the USA National Institute of Standards and Technology (NIST) "has initiated a process to solicit, evaluate, and standardize one or more quantum-resistant public-key cryptographic algorithms" [Nat] in 2016. In total, 69 complete and proper proposals were submitted for the first evaluation round. In October 2020, 15 candidates were announced to have made it through to a third round. It is expected that towards the end of 2021 the winners will be announced which become the NIST PQC standard.
One of the current finalists is Kyber [BDK + 18, SAB + 20]; this scheme belongs to the lattice-based key encapsulation mechanism (KEM) family. Among the other finalists, Saber [DKRV18,DKR + 20] and NTRU [HPS98,CDH + 20] also fall in this category. Kyber's hardness is based on the module learning-with-errors problem (M-LWE) in module lattices [LS15]. Unlike prime factorization and the discrete logarithm problem, the M-LWE problem is conjectured to be hard to solve even by an adversary who has access to a full-scale quantum computer.
Initially, the main evaluation criteria focused on the mathematical security and algorithmic design of the proposals. With the advance of the selection of the schemes also other important characteristics become relevant requirements: one of these is the implementation security. One important and well-known attack family is Side-Channel Attacks (SCA). First introduced by Kocher [Koc96], SCAs exploit meta-information when running the implementation to recover secret-key information. This could be obtained using, for example, timing analysis, static and dynamic power analysis, electromagnetic analysis or photoemission analysis.
Not surprisingly, works over the recent years have shown that side-channel attacks also affect post-quantum cryptography [TE15]. Timing attacks were first shown to be applicable to lattice-based cryptography by Silverman and Whyte [SW07]. Since then, it has been demonstrated that an adversary can utilize the non-constant time behavior of Gaussian samplers [BHLY16,EFGT17] as well as a generic cache-attack behavior [BBK + 17]. Power analysis attacks on lattices have been shown to be able to attack even masked implementations of lattice-based cryptography by targeting the number theoretic transform [PPM17,PP19,XPRO20], message encoding [RBRC20,ACLZ20], polynomial multiplication [HCY19], error correcting codes [DTVV19], decoders [SRSW20] or CCA-transform [GJN20,RRCB20].
To mitigate the threat of side-channel attacks various types of countermeasures can be applied. This research area has grown over the past few decades for classical cryptography. Techniques to offer side-channel attack resistance for both symmetric and asymmetric primitives are readily available. Applying countermeasures to cryptographic algorithms against side-channel attacks has an impact on the run-time of those algorithms. This impact is even more significant when protecting against higher-order attacks: the situation where an attacker attempts to combine multiple points to overcome the protection mechanisms. NIST specifically asked the scientific community to assist in the evaluation of the final round-3 submissions from a side-channel perspective [AASA + 20].
concepts of [RRVV15] but presenting a new decoding algorithm without tables and, in addition, proposes to mask all other secret-dependent modules. Similar as for the KEM case, masked signature schemes have been proposed in [BBE + 18, MGTF19,GR19].
The modular nature of KEMs makes it easy to focus on one aspect only which then can be re-used in multiple other schemes. To utilize this flexibility, [SPOG19] and [BPO + 20] propose higher-order efficient masked implementations of a binomial sampler and a polynomial comparison as used in many schemes. Note, however, that in a recent paper by Bhasin, D'Anvers, Heinz, Pöppelman and Van Beirendonck an attack was presented on a masked implementation of R-LWE implementations [BDH + 21]. The authors show that the first-order masked comparison of [OSPG18] and the higher-order version of [BPO + 20] are vulnerable to side-channel attacks.
A challenge when protecting against side-channel attacks is the fact that many popular schemes, such as Kyber, use a prime modulus. As observed in both [MGTF19] and [GR19], this results in a significant performance overhead compared to power-of-two moduli, which allow more efficient bit-operations and conversions. Due to the usage of such prime moduli in PQC schemes many prior algorithms needed to be adapted to fit this specific use-case. One of the other NIST finalists Saber [DKR + 20] does use a power-of-two moduli for its operations, and it has been shown how to turn this into an efficient first-order protected scheme by Beirendonck, D 'Anvers,Karmakar,Balasch and Verbauwhede in [BDK + 20]. An attack on this masked Saber implementation was subsequently presented by Ngo, Dubrova, Guo and Johansson [NDGJ21] who apply deep learning power analysis in combination with a lattice reductions step to recover the long-term secret key used. Note that this attack does not invalidate the first-order masking scheme of [BDK + 20], but rather efficiently exploits higher-order leakages. Therefore, generic solutions to thwart it are increasing the masking order or the noise level of the implementation.

Contributions.
To the best of our knowledge, a complete analysis on how to mask Kyber has not been conducted. Due to its similarity to other schemes, in particular NewHope, many masked modules can be re-used from previous works. The masking of the polynomial arithmetic with arithmetic masking, using a prime modulus q, and the masking of the symmetric components can be straightforwardly reused from prior implementations. However, there are some Kyber-specific functions for which no concrete masking scheme has been proposed yet, or previous solutions are limited or sub-optimal.
In this work, we present the first analysis to realize a complete masked Kyber. Notably, we show techniques how to construct both first-and higher-order masking schemes for Kyber with formal proofs in the probing model for the newly-proposed masked components. In addition, we present an implementation of our masked Kyber algorithms on a Cortex-M0+ with hardening and experimental validation of the security order for the first-order secure variants. The security of our first-order implementation is mechanically verified using the verification tool scVerif and the refined Stateful strong Non-Interference security notions [BGG + 21], capturing concrete execution and device-specific leakage behavior. We also present a Cortex-M4F implementation using the pqm4 [KRSS19] framework including several assembly-optimized routines and compare performance numbers to the unprotected implementation of pqm4.
To achieve a complete first-and higher-order masking of Kyber, we propose new masked algorithms for the following two modules.
• Masked One-Bit Compression. Kyber requires compressing an arithmetically masked polynomial to a Boolean-masked bit-string. Prior solutions are either limited to first-order masking (cf. [OSPG18]) or compression using a modulus which is a power-of-two (cf. [BDK + 20]). We propose a new approach based on a bit-sliced binary search, which overcomes both these limitations.
• Masked Decompressed Comparison. Kyber uses ciphertext compression. While this can be efficiently masked for power-of-two moduli (cf. [BDK + 20]), it introduces a non-negligible overhead for prime moduli. We introduce a new approach which compares uncompressed masked polynomials with compressed public polynomials. This enables us to avoid having to explicitly mask the ciphertext compression.
Previous works include discussions on how one could extend the techniques to higher order but no further details are provided. To the best of our knowledge, this paper presents the first higher-order masked implementation of any PQC KEM. Our target platforms are two of the most popular Arm Internet-of-Things processors that offer a 32-bit instruction set. Our first target platform is the most energy-efficient Arm processor available for constrained embedded applications: the Cortex-M0+. A popular Internet-of-Things processor which offers a 32-bit instruction set. Our first-order implementation together with the hardened new modules results in a slowdown of a factor 2.2 compared to an unmasked version of Kyber on this target platform. Furthermore, we present performance figures for second-and third-order masked implementation as well. The first-order unhardened implementation with optimized assembly routines for polynomial arithmetic on the second target platform, the Cortex-M4F, leads to a slowdown of a factor 3.5 compared to the optimized pqm4 version of Kyber.
We conduct experimental verification of the protection order by measuring the power consumption of the two proposed modules and assessing their leakage using Test Vector Leakage Assessment (TVLA) [GGJR + 11] using 100 000 measurements. The first-order hardened implementation of our two new modules does not show any detectable leakage.

Background
In this section, we first introduce the PQC KEM Kyber. In the description we focus on the functions that process secret-key-dependent material and therefore will need to be masked. For the full description of Kyber, we refer to [SAB + 20].
After a high-level description of in-scope Kyber concepts, we detail our SCA notation and concepts. This serves as a basis for the SCA security analysis of Kyber in Section 3.

Notation.
We denote the ring of integers modulo q with Z q = Z/qZ. Centered modular reduction is denoted by r = r mod ± q, where − q−1 2 < r ≤ q−1 2 , and other reductions by r = r mod q where 0 ≤ r < q. The rings Z[X]/(X n + 1) and Z q [X]/(X n + 1) are respectively denoted by and R and R q . Further, note that rounding to the closest integer is denoted by · (with ties rounded up) and rounding up is denoted by · .
We denote vectors and matrices by boldfaced variables b and A. As an ingredient of Kyber we need to define the centered binomial distribution CBD η , for a positive integer η. Sampling from this distribution is achieved by sampling the 2η elements of {(a i , b i )} η−1 i=0 uniformly from {0, 1} and outputting Compression, Decompression and Sampling. As a first building block of Kyber, we define a function Compress q (x, d) that takes an element x ∈ Z q and outputs an integer in {0, . . . , 2 d − 1}, where d < log 2 (q) . We furthermore define a function Decompress q , such that x = Decompress q (Compress q (x, d), d) is an element close to x, more specifically |x − x mod ± q| ≤ B q := q 2 d+1 . The functions satisfying these requirements are defined as When Compress q or Decompress q is used with x ∈ R q or x ∈ R k q , the procedure is applied to each coefficient individually.
As a second ingredient there is the sampling function CBD which converts uniformly random bytes into polynomials whose coefficients are distributed as CBD η . This algorithm is summarized in Algorithm 4 in the Appendix.

Kyber PKE.
Kyber is a module-LWE scheme [BGV12,LS15]. Given a parameter k its hardness relies on the hardness of distinguishing samples (a i , b i ) ∈ R k q × R q , where all elements are uniformly drawn from R q , from those where the elements of a i are drawn from a uniform distribution and b i = a T i s + e i , for a secret s ∈ β k η , and e i ∈ β η is refreshed for each sample.
The IND-CPA-secure Kyber public-key encryption (PKE) scheme consists of three algorithms; key generation, encryption (Algorithm 5) and decryption (Algorithm 6). Kyber.CPAPKE is parameterized by n, k, q, η 1 , η 2 , d u and d v . The recommended parameter sets are listed in Table 3 where δ is the failure probability of the decryption. Since the key generation only processes the secret key once, and masking is commonly aimed at mitigating multi-trace attacks, we omit its description here. Kyber KEM. An IND-CCA2-secure KEM Kyber.CCAKEM can be constructed from the Kyber.CPAPKE scheme by applying a version of the Fujisaki-Okamoto transform [FO99,HHK17]. The resulting scheme consists of key generation, encapsulation and decapsulation schemes. The high-level decapsulation Kyber.CCAKEM.Dec is of main interest in this work since this is the only part affected by our masking techniques: its description is given in Algorithm 7. Again, we refer to [SAB + 20] for details on G, H and KDF. Side-Channel Notation and Notions. The core concept of masking is to split the sensitive variables into multiple shares and transform the underlying circuit to process these shared variables securely. To formally argue about the security provided by these shared implementations, Ishai, Sahai and Wagner introduced the t-probing model in [ISW03], which models an adversary that can probe up to t intermediate variables. If every possible t-tuple of a given masked circuit is independent of the secret, it is considered to be secure against t-order SCA attacks.
In the following, a sensitive variable x is split into n s secret shares and the resulting n s -tuple is denoted as x (·) . Where applicable, we denote an arithmetic encoding of a variable x ∈ Z q as x (·) A consisting of n s arithmetic shares x (i) A ∈ Z q , 0 ≤ i < n s such that Where applicable, we denote a Boolean encoding of a variable x ∈ Z k 2 as x (·) B consisting of n s Boolean shares Given a polynomial f ∈ R q , the i-th coefficient of f is denoted as f i . Given a bitstring b ∈ Z k 2 , the i-th bit of b is denoted as b i . While proving probing security alone is sufficient for single functions (gadgets in the following), it does not easily allow arguing about compositions of multiple gadgets at higher orders (i.e., t > 1). Therefore, it is common to rely on the concepts of t-(Strong)-Non-Interference (t-(S)NI) as introduced in [BBD + 16] to argue about the security of such constructions. We recall the t-NI and t-SNI security notions as presented in [BCZ18]. We consider a gadget taking as input a (or multiple) n s -tuple x (·) of shares, and outputting a (or multiple) n s -tuple y (·) . Given a subset I ⊂ [0, n s − 1], we denote by x (I) all elements Definition 1 (t-NI security (from [BBD + 15, BBD + 16])). Let G be a gadget taking as input x (·) and outputting y (·) . The gadget G is t-NI secure if for any set of t G ≤ t intermediate variables, there exists a subset I ⊂ [0, n s − 1] of input indices with |I| ≤ t G , such that the t G intermediate variables can be perfectly simulated from x (I) .
Definition 2 (t-SNI security (from [BBD + 16])). Let G be a gadget taking as input x (·) and outputting y (·) . The gadget G is t-SNI secure if for any set of t G ≤ t intermediate variables and any subset O ⊂ [0, n s − 1] of output indices, such that t G + |O| ≤ t, there exists a subset I ⊂ [0, n s − 1] of input indices with |I| ≤ t G , such that the t G intermediate variables and the output variables y (O) can be perfectly simulated from x (I) .
In Section 3, we prove our new algorithms to be t-SNI with n s = t + 1 to provide resistance against t-order attacks and allow formal arguing about their composition. Figure 1: An overview of the various components in the Kyber CCA decapsulation. The components which need to be protected / masked are in color (gray or green) where we present new approaches for the green components.

Masking Kyber at Arbitrary Order
The post-quantum secure key encapsulation mechanism Kyber has a structure similar to other submissions to the NIST PQC standardization effort such as NewHope and Saber.
In particular, it first applies a CPA decryption to the ciphertext in order to create a message m. This message is then re-encrypted with the CPA encryption and the resulting ciphertext is compared with the original input. Depending on the Boolean result of this comparison, a session-key K is derived either from the message if the ciphertext and the original input are the same or from a secret fixed value z otherwise. A graphical overview of the various modules in Kyber decapsulation is given in Figure 1; the colored components are those that need to be masked. The decapsulation is deterministic and therefore all modules which process sensitive data that is derived from the long-term secret s, need to be protected against SCAs. In this section we first focus on the two green Compress q and DecompressedComparison modules. We present two new approaches for these modules: masked one-bit compression (Section 3.1) and masked comparison (Section 3.2). For each, we first provide the basic intuition about their functionality and then prove their t-SNI security in the probing model. We then put the components together to achieve a fully masked Kyber in Section 3.3.

Higher-Order One-Bit Compression
For Saber, where the used modulus is a power-of-two, the compression operation represents a shift of the sensitive value: this can be efficiently masked using look-up tables as demonstrated in [BDK + 20]. For schemes that use a prime modulus, masking this step is more involved. The authors of [OSPG18] propose a first-order masked solution based on two mask conversions: one arithmetic-to-arithmetic (A2A) and one arithmetic-to-Boolean (A2B) conversion per polynomial coefficient. To improve efficiency and allow extensions to higher orders, we propose a new approach which works for any modulus and at any order that requires only one conversion per coefficient.
Informally, the compression to one bit in Kyber splits the domain of each polynomial coefficient into two disjunctive intervals and assigns a bit value depending on which interval the value of the coefficient is contained. In Kyber, this is done with the function Compress q (x, d) = 2 d /q · x mod 2 d . The compression to one bit results in the following mapping This computation is trivial without masks, but poses a challenge when q is prime and masking is required. When the modulus is a power-of-two, less-than-comparisons can be Algorithm 1 Masked version of Compress q (x, 1) = Compress s q (x + q 4 mod q) as used in Kyber for any order using one A2B conversion per coefficient.
Input: An arithmetic sharing a (·) A of a polynomial a ∈ Z q [X]. Output: A Boolean sharing m (·) B of the message m = Compress q (a, 1) ∈ Z 2 256 .
1: for i = 0 to 255 do 2: computed using a B2A conversion [OSPG18]. However, for prime moduli the value space is not equally divided by specific bits. In this case masking Compress q requires either the use of pre-computed tables or a dedicated masked-compression algorithm.
Let us first recall the first-order based approach from [OSPG18]. Given a masked coefficient a (·) A , they first apply an A2A conversion to produce a masked coefficient b (·) A with a power-of-two modulus such that where 2 k > q. Next, an appropriate offset is subtracted from b (·) A such that the MSB of the sensitive variable denotes the value to which the coefficient should be compressed. This shared bit can be extracted from the Boolean shares after applying an A2B conversion. Hence, this technique requires one A2A and one A2B conversion per coefficient. Given that these conversions are usually quite expensive this introduces a significant overhead. Furthermore, to the best of our knowledge, there are no known results for higher-order A2A conversion for arbitrary moduli. The only other published solution in this direction is presented in [BDK + 20] and applies only to power-of-two moduli.
We present a solution which can be applied in first-and higher-order settings, can be applied to the setting where a prime modulus is used and is faster by omitting one A2A conversion per coefficient compared to the state-of-the-art. In the remainder we introduce the method with focus on the Kyber application, however, it should be noted that this approach works for any modulus q. We start with adding the offset q 4 = 832 modulo q from the arithmetic shares with a subsequent A2B conversion to create k-bit Boolean shares of the coefficient, where k = log 2 (q) = 12. Given these Boolean shares, it then suffices to securely compute whether the masked value is smaller than q 2 . Let us denote this shifted function as Compress s q (x) such that Compress q (x, 1) = Compress s q (x + q 4 mod q) where To compute Compress s q in a masked fashion, we perform a masked-binary-search on the Boolean-shared bits of the coefficient starting from the MSB. For example, if the MSB is set to 1, we can ignore the values of all subsequent bits and compress the coefficient to 1 as 2 MSB = 2 k−1 = 2 11 > q 2 . If the MSB is set to 0, the remaining bits need to be taken into account. This process is repeated until all possible coefficient values have been mapped to a single bit value. For the case of Kyber, q 2 = 1664, bits 11 to 7 are taken into account. In this case Compress s q is computed as In a masked implementation, the ⊕ and · operations should be replaced with calls to their secure counterparts (SecXOR and SecAND). Moreover, to improve efficiency, we can first transform the Boolean shares of the polynomial to a bitsliced representation and compute the compress function for all coefficients in parallel (limited by the word size of the target platform). This complete masked algorithm for Compress q (x, 1) is given in Algorithm 1. Note that the algorithm is independent of the specific masked algorithms used for the modules A2B, Bitslice, SecAND, SecREF, and SecXOR. Instead, we provide a short description of the computed functionality and the assumed security property for the proof. A2B denotes a t-SNI secure conversion of arithmetic shares with a prime modulus to Boolean shares encoding the same value. In our higher-order implementations, we use Algorithm 3 from [SPOG19]. Bitslice maps a Boolean-masked polynomial to its Boolean-masked bitsliced representation. This is a linear function and can, therefore, be computed on each share separately. The most efficient way to accomplish this strongly depends on the capabilities of the target platform. In our implementation, we realized it as a sequence of bitshift, bitwise OR, and bitwise AND to rearrange the bits share by share.
With SecAND and SecREF, we describe t-SNI algorithms to compute the masked bitwise AND and refresh Boolean shares. There are multiple proposals that fulfill this property, e.g., [GJRS18,BCPZ16], and we use Algorithm 18 (resp. Algorithm 20) from [SPOG19] for SecAND (resp. SecREF) in our implementations. SecXOR refers to the t-NI computation of the bitwise XOR of Boolean shares. This is usually achieved by computing the XOR of the input shares separately, e.g., [CGV14, Algorithm 3, Line 3] Furthermore, we use ¬ to indicate the negation of only the first share of the Boolean-masked input.
Correctness. For Kyber, we use q = 3329 with the parameters k = 12, q 4 = 832 and q 2 = 1664. Let us provide the detailed steps to derive the equation to compute the compression operation Compress s q (x) using only XOR, AND, and negation.

Figure 2:
The gadgets considered in the proof of Theorem 1. t-NI gadgets are depicted with a single circle, t-SNI gadgets are depicted with a double circle.

Complexity.
We estimate the run-time complexity T A1 (n s , k), where n s goes to infinity, of Algorithm 1. For ease of notation we write T f (n s , k) as T f .
with k = log 2 (q) = 12 and w denoting the word size of the target platform, i.e., how many bits can be processed in parallel.
(1) (negating one of the input shares), and T SecXOR = O (n s ) (computing the sharewise XOR) as the complexities for the modules, we derive the asymptotic run-time complexity for Algorithm 1 as T A1 = O n 2 s · log 2 (k) for a constant p. Analogously, we can derive the randomness complexity R A1 = O n 2 s · log 2 (k) by replacing the run-time complexity of the modules with the corresponding randomness complexities. Security. To argue about the higher-order security of Algorithm 1, we prove it to be t-SNI with n s = t + 1 shares. This provides resistance against a probing adversary with t probes and allows using the algorithm in larger compositions. The proof requires us to show how probes on intermediate (and output) variables in the algorithm can be perfectly simulated with only a limited number of the input shares. To this end, we iterate over all possible intermediate variables, starting from the output, and provide formal arguments on how they can be simulated relying on the t-(S)NI properties of the modules. In this step, it is important to ensure that the simulation of t x probes on one intermediate variable does not require more than t x shares of another intermediate variable. Otherwise, the simulation is not sound as it would require more than t shares of one intermediate variable for t x = t. For t-SNI, it is important to further show that the simulation of the intermediate and output probes can be performed with only a subset of the input shares with cardinality equal to the number of intermediate probes.

Theorem 1 (t-SNI of Algorithm 1). Let a (·) A be the input and let m (·) B be the output of Algorithm 1. For any set of t A1 intermediate variables and any subset
Proof. We model Algorithm 1 as a sequence of t-(S)NI gadgets as depicted in Figure 2. For simplicity, we model the linear operations in Lines 2, 4, 5, 9 as t-NI gadgets, which can be trivially shown as the operations process the inputs share-wise. Furthermore, as the iterations of the initial loop are independent, we consider them to be executed in parallel and summarize them into single gadgets, one for Line 2 and one for Line 3. The exact mapping of gadgets in Figure 2 to Algorithm 1 is as follows: • G 1 (NI): Subtraction in Line 2.
An adversary can place probes internally and on the output shares for each gadget. The number of internal (resp. output) probes for gadget G i is denoted as t Gi (resp. o Gi ) with where t A1 and |O| refer to the number of probes and output shares of the complete Algorithm 1 as used in Theorem 1. To prove Theorem 1, we show that the internal probes and output shares can be perfectly simulated with ≤ t A1 of the input shares a (·) A . To this end, we argue about the internal probes and output shares of each gadget relying on their t-(S)NI property. In particular, we rely on the characteristic that the simulation of a t-SNI gadget can be performed independent of the number of probed output shares. This allows stopping the propagation of probes from the output shares to the input shares. For example, to simulate the t G13 intermediate and o G13 output probes of the t-NI gadget G 13 , we require t G13 + o G13 shares of both inputs of G 13 (i.e., x (·) B 11 and the output of G 12 ). Throughout a larger composition, the shares required for simulation are added up. To avoid an unsound simulation, it is often required to use t-SNI gadgets to stop the propagation of probes on the output shares, e.g., the t G12 intermediate and o G12 output probes of the t-SNI gadget G 12 can be simulated with only t G12 input shares (i.e., without o G12 ). As shown later, we need to insert t-SNI SecREF gadgets to ensure that our simulation is sound.
In the following, we provide details for the simulation at particular points in the algorithm. The complete explanation for each gadget is provided in Appendix A. To simulate the internal probes and output shares of gadgets G 4 to G 13 , we need the following number of shares of x To argue about Bitslice, we summarize the variables of each bit to x (·) B . For the simulation, we add up the number of shares for each bit as t . This simulation can only be performed if there are no duplicate entries in the sum: without the t-SNI refresh G 5 , the simulation would require t G6 shares of both x . In effect, t x (·) B would be ≥ 2 · t G6 , which cannot be simulated for, e.g., t G6 = t. Therefore, it is necessary to refresh 1 the input to G 6 , and analogously to G 9 . For the other SecAND gadgets, this issue does not occur and, therefore, we do not need to refresh their inputs.
Given the t-NI property of Bitslice, we can simulate the t x (·) B shares of x (·) B with the corresponding number of shares of a (·) B . Following the flow through gadgets G 2 and G 1 , the simulation of Algorithm 1 requires |I| = t G1 + o G1 + t G2 of the input shares a (·) A . In particular, the t-SNI property of G 2 allows to simulate the shares of a (·) B with only t G2 of its input, i.e., it is independent of the number of the probes on a (·) B , which stops the propagation of t x (·) B to |I|. As |I| ≤ t A1 and independent of o 13 , Algorithm 1 is t-SNI.
Extension Compress q (x, d) for d > 1. We use Algorithm 1 for compression to d = 1 bits, but it can be adapted to create masked compression functions for d > 1 bits as well. To this end, it is necessary to derive the Boolean equations for each of the d output bits, analogous to Equation (1). These are then computed using instantiations of SecXOR and SecAND with independent shared inputs. A generic description of Algorithm 1 for any d would, therefore, need to refresh the input to any SecAND, which would induce a significant overhead. In this section an optimized version for d = 1, as used in Kyber, is provided. The creation of optimized versions for other d is straight-forward when using formal verification tools to check which of the refreshes are needed, e.g., scVerif [BGG + 21] or MaskVerif [BBD + 15, BBC + 19]. In the following section, we develop a dedicated technique to avoid masking the ciphertext compression of Kyber.CPAPKE.Enc, i.e., extending Algorithm 1 to d > 1, as this would require to process all input bits.

Higher-Order Masked Comparison
The masked ciphertext comparison requires computing c ? = c in a masked fashion, which assumes prior ciphertext compression in Kyber.CPAPKE.Enc. More explicitly, the comparison verifies whether For the ciphertext compression, to the best of our knowledge, there is no efficient higherorder solution beyond generic approaches, e.g., masked look-up tables, when using prime moduli. In [OSPG18] a hash-based first-order comparison approach is proposed. However, this only checks for equality and is independent of the ciphertext compression. To apply this technique to Kyber, it would be necessary to perform a masked ciphertext compression as a prior step. In [BPO + 20], a higher-order polynomial comparison is proposed which also checks for equality but suffers from similar drawbacks as the techniques from [OSPG18]. Note that this approach was also shown to be flawed in [BDH + 21] and the proposed fix significantly reduces the performance. 2 Given that none of the prior-art solutions work without a masked ciphertext compression, we propose a new masked algorithm to perform the comparison between a masked uncompressed ciphertext (i.e., output of our masked re-encryption) and a public compressed ciphertext. The core idea is to not perform the costly masked compression of the sensitive values but work the other way around: decompressing the public ciphertext. Since this is public information this can be done efficiently. Informally, this changes Eq. (2) to Since the compression is lossy, one cannot simply check for equality. Instead, one has to perform a masked range check for each coefficient to verify that the uncompressed sensitive values fall into the decompressed interval. In particular, one has to first derive the interval start-and end-point for a compressed coefficient using public functions S and E. These border values are then subtracted from the compressed masked coefficients separately; which is efficient given that they are arithmetically masked. Then each of these values are transformed to Boolean masking to extract the MSB which contains the result of the Algorithm 2 Masked DecompressedComparison as used in Kyber.

Input:
1. An arithmetic sharing u (·) A of a vector of polynomials u ∈ Z q [X] k , 2. An arithmetic sharing v (·) A of a polynomial v ∈ Z q [X], 3. A ciphertext c ∈ B du·k·n/8+dv·n/8 , 4. Two public functions S and E defined by Kyber which specify the start-and end-points of the intervals in compression.
coefficient interval check: the MSB can be viewed as something similar to the "sign" bit, see for a more detailed explanation the correctness paragraph below.
If the compressed coefficient is indeed inside the desired interval, the MSB of both range checks should be one. For the comparison, we need to combine the interval checks of all coefficients into one masked output bit. This is achieved using bitsliced calls to SecAND until one bit remains, which is only set to one if and only if Eq. (3) is fulfilled. The complete masked algorithm for the DecompressedComparison is given in Algorithm 2. Again we provide a short description of the new modules and their assumed security property. MSB extracts the Boolean-masked most significant bit of the given input Boolean shares, which is assumed to be t-NI as it can be applied on each share separately. LSR refers to the sharewise logical shift to the right of input Boolean shares by a given offset, which is also assumed to be t-NI.
Correctness. To better understand Algorithm 2, let us first go through the unmasked decompressed comparison using one coefficient as an example. We move the costly compression step to the public variable as a ? = Decompress q (b), i.e., we check if the public b would be decompressed to an interval which contains a. As the compression is lossy, there are multiple values for a which can be mapped to b through Compress q . Therefore, a straight-forward check for this equality would only work for one specific value of a. Instead, . While this is trivial for unmasked values, a is sensitive and, therefore, this operation needs to be masked. Performing a generic less-than comparison check is straight-forward for power-of-two moduli, but challenging for prime moduli such as used in Kyber. The idea to achieve this is to compute a − S(b) mod q and a − E(b) and checking the "sign" bits. This is done in a masked fashion by performing first an A2B and subsequently extracting the MSB of the masks. If a is indeed in the interval [S(b), E(b) − 1] then one expects a − S(b) to return an MSB with a 0 while a − E(b) should be 1. These can be combined with a SecAND by first negating the masked bit of the first range check.
In order to avoid this negation, one can shift the values in the first check appropriately (by adding 2 log 2 (q) −1 ) such that it becomes a − S(b) + 2 11 in the setting of Kyber. Now both the resulting MSB need to be put into a SecAND to produce the masked output of the comparison for this coefficient. It should be noted that this approach requires that the size of the largest interval [S(b), E(b) − 1] should be smaller or equal to the difference of the used modulus to the next smaller power of two; i.e, q − 2 log 2 (q) − 1. For Kyber this is indeed the case for all parameter sets; for the values d ∈ {4, 5, 10, 11} defined in Kyber and used in Compress q (x, d) we have an interval size of at most 209 which is well below q − 2 log 2 (q) − 1 = 3329 − 2 11 = 1281. Note that the special case where d = 1 is handled in detail in Section 3.1. Complexity. Again, we estimate the run-time complexity T A2 (n s , k), where n s goes to infinity, of Algorithm 2.
with k = log 2 (q) = 12 and w denoting the word size of the target platform. Given in addition to the previous section, T MSB = T LSR = T mod = O (n s ) (applying the operation sharewise), as the complexities for the modules, we derive the asymptotic run-time complexity for Algorithm 2 as T A2 = O n 2 s · log 2 (k) for a constant p. Again it is dominated by the A2B conversion and, as in the previous case, the randomness complexity is R A2 = O n 2 s · log 2 (k) . Security. As we also did for Algorithm 1, we now prove Algorithm 2 to be t-SNI with n s = t + 1 shares. Proof. Again, we model Algorithm 2 as a sequence of t-(S)NI gadgets as depicted in Figure 3 and, as was the case in Theorem 1, we model the linear operation in lines 7, 8, 15, 16, 17, 20 as t-NI gadgets. The input c and its derived variables (u , v ) and (s u , s u , s v , e v ) are not explicitly considered in the proof as they are public values and their simulation is therefore trivial. As before, given that the iterations of the initial loops (i.e., PolyCompare) are independent, we consider them to be executed in parallel and summarize them into single gadgets. In this regard, we model the sequence MSB • A2B as a single t-SNI gadget, which holds if the conversion A2B is t-SNI and MSB is applied sharewise. We unroll the final loop into two iterations, but the presented simulation concept generalizes to any number of rounds due to the t-SNI property of the used SecAND. The exact mapping of gadgets in Figure 3 to Algorithm 2 is as follows: • G 1−4 (NI): Assignment in Line 15. An adversary can place probes internally and on the output shares of each gadget. The number of internal (resp. output) probes for gadget G i is denoted as t Gi (resp. o Gi ) with where t A2 and |O| refer to respectively the number of probes and output shares of the complete Algorithm 2 as used in Theorem 2. To prove Theorem 2, we show that the internal probes and output shares can be perfectly simulated with ≤ t A2 of the input shares u (·) A and v (·) A . Again, we provide details for the simulation at particular points in the algorithm. The complete explanation for each gadget is provided in Appendix B.
Starting with G 25 , its t G25 internal probes and o G25 output shares can be simulated with t G25 of the output shares of G 23 and G 24 . This leads to a problem, however, as the simulation of these output shares requires a corresponding number of shares of the output of G 22 , i.e., 2 · t G25 . Therefore, on a variable-level these probes cannot be perfectly simulated. To overcome this issue, we model the gadgets G 17 to G 25 to work on bit-level rather than on the complete variables. This requires that we build multi-bit SecAND from parallel and independent one-bit instantiations of SecAND which each are t-SNI. The t-NI gadgets which are used to extract the upper and lower halves (G 20,21,23,24 ) can be represented similarly by one-bit t-NI gadgets, i.e., only selected bits are passed through, while the others are discarded.
We now explain the simulation for probes on the Least Significant Bit (LSB), but the presented approach applies to probes on arbitrary bits. To simulate t G25 internal probes and o G25 output shares of the LSB of G 25 , we need t G25 output shares of the LSB of G 23 and G 24 . The former can be simulated with (t G25 + t G23 + o G23 ) output shares of the LSB of the upper half of G 22 , while the latter requires (t G25 + t G24 + o G24 ) output shares of the LSB of the lower half of G 22 . As these halves are independent, the simulation succeeds.
The same approach can be applied to simulate gadgets G 20−22 . They require (t G22 + t G20 + o G20 ) output shares of the LSB of the upper half and (t G22 + t G21 + o G21 ) output shares of the LSB of the lower half of G 19 . For G 17 (resp. G 18 ), we need t G17 (resp. t G18 ) output shares of the LSB of t ). To extend the simulation to probes on arbitrary bits, it is sufficient to replace the LSB with the corresponding indices of the probed bits.
As in the proof of Theorem 1, we add up the shares required for simulation for each of the bits of the bitsliced variables (t ) to argue about Bitslice. This is valid as there are no duplicate entries in the sums even without refreshes due to the t-SNI property of G 17,18 , e.g., t t (·) B w = 255 i=0 t G17−Biti . Again relying on the t-NI property of Bitslice, we can simulate the shares of (t with the corresponding number of shares of (w (·) B , x (·) B , y (·) B , z (·) B ). Following the flow through gadgets G 1−12 , the simulation of Algorithm 2 requires |I| of the input shares u (·) A and v (·) A with Any probe on computations involving the outputs of Gadgets G 9 to G 12 does not propagate to the respective inputs due to the t-SNI property of these gadgets. As |I| ≤ t A2 and independent of o 25 , Algorithm 2 is t-SNI.

Masked CCA Kyber Decapsulation
We now return to the Kyber decapsulation given in Figure 1 and reason on the SCA security of the complete decapsulation. Note that we omitted the encode-and decode-operations as well as the generation of the public matrix A in the figure, as they are either trivial to mask or process only public values.
The linear polynomial operations (i.e., •, NTT, −, +) in Kyber.CPAPKE.Dec and Kyber.CPAPKE.Enc are masked as in previous works by applying the operation on each share separately. For Compress q (x, 1) of Kyber.CPAPKE.Dec, we rely on our new approach as presented in Section 3.1.
To mask the symmetric components G and P RF , we rely on prior art. In particular, we use the masked Keccak approach and implementation from [BBD + 16] to instantiate the modules at higher order while we use the more efficient approach from [BDPVA10] for the first order. We believe there is room for performance improvement by creating dedicated and more efficient masking schemes of Keccak aiming at a specific masking order (as done for the first-order setting), but this is out of scope for the current work.
For Decompress q , we first convert each bit of the Boolean-shared message m (·) B to arithmetic shares modulo q (e.g., using the efficient one-bit B2A algorithm from [SPOG19] at higher orders) which are then multiplied with a constant.
Then we convert f to the arithmetic domain with shift constant 4, i.e., apply the conversion to f ⊕ 4. Note that all these operations are applied in a bitsliced manner: on a 32-bit target Figure 4: The gadgets considered in the proof of Theorem 3. t-NI gadgets are depicted with a single circle, t-SNI gadgets are depicted with a double circle.
platform these operations can be performed on 8 coefficients at the same time, assuming a and b are represented with 2 bits each. The subtraction of 4 after the conversion is trivial in arithmetic domain. As depicted in Figure 1, our approach does not explicitly mask the ciphertext compression of Kyber.CPAPKE.Enc. Instead, we instantiate the comparison as presented in Section 3.2 which can process masked uncompressed polynomials. Furthermore, as in line with the findings of [BDH + 21], we collapse the result of the comparison to a single masked bit before unmasking it for the selection of the KDF input.
We follow the approach and reasoning of [BDK + 20] and do not mask the KDF. Instead, if the comparison outputs true (i.e., the ciphertext is valid), we unmask K and perform an unmasked KDF. For a valid ciphertext this leaks only ephemeral secret information and not the long-term secret. Should this short-term secret also be protected, other countermeasures besides masking can be applied to mitigate against single-trace attacks. Note that it is important to not unmask K if the comparison fails, because this could be used to attack the long-term secret. If the comparison does fail, we apply an unmasked KDF to the secret value z. This value is independent of the secret key, but leaking it allows an adversary to detect ciphertext rejection explicitly. This does not impact the IND-CCA security claims of Kyber [HHK17, Figure 1] as Kyber is γ-spread for sufficiently large γ.
To argue about the probing security of masked Kyber.CCAKEM.Dec, we analyze a reduced composition (denoted as G Dec ) excluding the unmasked components. The structure of G Dec is depicted in Figure 4.

there exists a subset I of input indices such that the t G Dec intermediate variables as well as b (O) B andK (O) B can be perfectly simulated fromŝ (I)
Proof. We model the linear operations of the decryption and encryption as t-NI gadgets G 1 and G 5 . The new t-SNI Compress q (x, 1) and comparison algorithms are included as G 2 and G 7 . The symmetric components are modeled as a t-NI gadget G 6 . As shown in [SPOG19], the sampling algorithm is t-SNI and their proof is independent of the concrete instantiation parameters. Therefore, we model it as t-SNI gadget G 4 . For Decompress q , we assume a t-SNI gadget G 3 which relates to the t-SNI B2A conversion. The subsequent linear multiplication is included in G 5 .
An adversary can place probes internally and on the output shares of each gadget. The number of internal (resp. output) probes for gadget G i is denoted as t Gi (resp. o Gi ) with where t G Dec and |O| refer to respectively the number of probes and output shares of the complete gadget G Dec as used in Theorem 3. To prove Theorem 3, we show that the internal probes and output shares can be perfectly simulated with ≤ t G Dec of the input sharesŝ (·) A . Again, we provide details for the simulation at particular points in the algorithm. The complete explanation for each gadget is provided in Appendix C.
To simulate the internal probes and output shares of gadgets G 3 to G 7 , we need t G3 + t G6 + o G6 + t G4 shares of m (·) B . Following the flow through gadgets G 1,2 , the simulation of G Dec requires |I| = t G1 + o G1 + t G2 of the input sharesŝ (I) A . As |I| ≤ t G Dec and independent of o G6 and o G7 , gadget G Dec is t-SNI.

Implementation and Evaluation
We present performance and practical security results of the new masked algorithms presented in Section 3. We target Kyber768 since this is the recommended parameter set targeting the NIST security level 3. We select two platforms: firstly, benchmarks (using the SysTick timer) and measurements are performed on an NXP Freedom Development Board for Kinetis Ultra-Low-Power KL82 MCUs (FRDM-KL82Z [NXP16]). The Cortex-M0+ was chosen because it is the most energy-efficient Arm processor available for constrained embedded applications. The processor comes equipped with a 2-stage pipeline, the Armv6-M architecture and the Thumb/Thumb-2 subset of instruction support. This allows the Cortex-M0+ to be a perfect candidate to harden cryptographic primitives since the hardened assembly code for the Cortex-M0+ can run on more advanced Arm instruction sets while vice-versa this is not necessarily true. Therefore, this hardened Cortex-M0+ implementation can serve as a helpful starting point to create secure hardened implementations for other Cortex target processors. Moreover, the side-channel behavior of the Cortex-M0+ is well understood, allowing to mitigate device-specific leakage behavior with fine-grained hardening strategies [BGG + 21], whereas the unclear side-channel characteristics of the Cortex-M4 forces inserting dummy operations in many places, purely based on assumptions, resulting in additional performance penalties which are independent of the masked algorithms. The difference in understanding has been used to save up to 72% of dummy operations and can lead to optimized implementations which are twice as fast [BGG + 21].
Secondly, although we do not perform measurements or any hardening, we do performance benchmarks on the STM32F407G-DISC1 board that comes equipped with a Cortex-M4F (previously known as STM32F4DISCOVERY). This is the platform used by the embedded crypto benchmark platform pqm4 [KRSS19] and recent masked implementations of Saber [BDK + 20], so allows us to compare to existing work more easily. We make use of the standard measurement framework of pqm4, with minor modifications to measure the run-time of subroutines.
In this section, a component-wise performance comparison for various orders and implementation choices is given. Our masked Kyber implementation is generally written in C and based on the C-reference code from the Round 3 Kyber submission. For the Cortex-M4 processor we included the optimized assembly routines from pqm4, but the used assembly is incompatible with the much simpler Cortex-M0+. On the other hand, for our first-order Cortex-M0+ implementation we provide low-level formal verification and physical leakage assessments based on power measurements. For this purpose we target our own components Compress q (., 1) and DecompressedComparison. These hardened components (and any components that they rely on) are therefore written in assembly. Although hardening involves adding dummy operations that would decrease efficiency, our hand-written hardened assembly still performs better than the compiler-generated versions from the (unhardened, masked) plain-C implementation. Following the same approach as [BDK + 20], we use an already existing masked implementation of Keccak in our masked Kyber implementation. More specifically, we re-used the first-order masked implementation from [BDPVA10] and for higher orders use the more generic higher-order secure implementation of Keccak from maskComp [BBD + 16]. Randomness Generation. During the execution of decapsulation, fresh randomness is needed for the masked operations. For example, the first-order masked implementation on the FRDM-KL82Z uses 11 665 uniformly randomly sampled bytes for the decapsulation operation (see Table 2). As we would like the power measurements to be reproducible, the numbers for the FRDM-KL82Z reported in Table 2 assume that the random bytes can be readily read off from a table, which is filled before execution of the Kyber functions. Therefore, the cost of randomness generation is not included in our performance numbers for this platform. On the other hand, the STM32F407G-DISC1 board comes equipped with a TRNG. For fair comparison to existing work we do include the randomness generation in the cycle counts on this platform.

Performance Comparison
The main goal of this section is to demonstrate the feasibility of the new techniques to realize a (higher-order) masked Kyber implementation. For the FRDM-KL82Z we present the results of plain-C implementations and do not optimize on assembly level except for hardening some components. That being said, our first-order masked implementation does aim to be efficient from an algorithmic point of view to fairly represent the performance impact. For the STM32F407G we include the optimized assembly routines from pqm4. All implementations were compiled with arm-none-eabi-gcc version 8.3.1 with optimization level O3. The higher-order implementations (i.e., the second and third order results in Table 2) are not as aggressively optimized and therefore have more room left for improvement, in particular because the existing higher-order masked Keccak implementations are not heavily optimized.
First-Order Masking. Recall that we use Algorithm 1 for Compress q (., 1), Algorithm 2 for DecompressedComparison and [SPOG19, Algorithm 3] for the conversion from arithmetic to Boolean shares (A2B). However, for first-order masking the algorithm from [SPOG19] is not the most efficient. Instead, we use a Look-Up-Table (LUT) based approach; more specifically, the improved [Deb12] version of the Coron-Tchulkine method [CT03]. This algorithm was designed for power-of-two moduli so cannot be used directly for our prime q. To overcome this we simply use larger tables to avoid dealing with any carry propagation. Moreover, we also refresh the output with fresh randomness as the input and output masks in our case are from different domains, and to achieve the assumed t-SNI property.
Let us give a concrete example of the approach for the implementation of the first-order A2B conversion. The table L satisfies L (a) = (a + r a mod q) ⊕ r b , where a is a secret value in [0, q − 1] which is arithmetically masked with randomness r a ∈ [0, q − 1] on the input side and a Boolean mask with the random value r b ∈ [0, 2 log 2 (q) − 1] is applied on the output side. Then the arithmetic to Boolean conversion is implemented as in Figure 5. Here rng(x, y) stores y uniformly random sampled bits in x, and csubq performs a conditional subtraction by the modulus q. More explicitly, the constant-time equivalent of the C-expression "c = ((c >= q)? c-q : c)". The different LUT settings used in our first-order (FO) and higher-order (HO) implementations on the FRDM-KL82Z and their corresponding cycle counts rounded up to nearest 10 3 cycles. For first order a LUT is used for A2B, for higher orders it is not.

Setting
Approach #Cycles The A2B of [Deb12] outperforms the method of [SPOG19] at first order, and therefore we use it in our first-order implementation. It is however not directly clear whether a similar approach can be used for Compress q (., 1) and DecompressedComparison. Indeed, a completely analogous masked LUT can be created for these functions. For example, one can replace DecompressedComparison by implementing Compress q (., d) with a LUT, followed by a hashed comparison as done in previous work [OSPG18,BDK + 20]. We compare the performance of the various options in Table 1. Concretely, we use the following four settings in our masked implementations. The first setting (denoted setting 0) uses no LUTs at all; this is the default for the higher-order (> 1) masked-implementations. The first-order implementations do use the LUT-approach for A2B and the respective possible settings for the FRDM-KL82Z for our modules can be found in Table 1. The LUTs are generated fresh for every Kyber invocation and this run-time is included in the overall reported performance results (Init). Since Init, Compress q (., 1) and DecompressedComparison are only called once per decapsulation, the total cost is computed as the sum of the separate functions. It is clear that using the new algorithms introduced in this work is favorable compared to using the LUT-approach even in the first-order setting. This is due to the fact that the LUT-initialization takes a non-negligible amount of time, which is significant because the functions are only used once (as opposed to A2B). Therefore, our first-order implementations use setting 3. Performance Discussion. We present a complete overview of our performance results on both the FRDM-KL82Z and the STM32F407G in Table 2. As a baseline unmasked implementation for the FRDM-KL82Z we take the Kyber768 implementation from the PQClean [KRS + 19] software library. This is purely written in C and therefore a comparison to our plain-C implementation is fair. Of course some of our modules contain assembly modifications to harden them against power analysis, but this leads to only minor differences in cycle counts. For the STM32F407G we take the currently best optimized implementation from pqm4.
We see that the overall slowdown factor for crypto_kem_dec masked at first order on the FRDM-KL82Z is 2.2x. A large part of this can be attributed to the masked encryption step, which uses the masked PRF as part of the binomial sampler while also about doubling the cost of the polynomial arithmetic. The Compress q (., 1) and DecompressedComparison (denoted comparison in the table) also introduce large slowdowns, but this was to be expected as their cost in the unmasked version was almost negligible. It should be noted that the slowdown factor is relatively small due to the lack of assembly optimizations: since the cost of polynomial arithmetic is still significant, while it has a small slowdown factor, the overall slowdown compared to the reference implementation is brought down.
On the other hand, the slowdown factor for first-order masked decapsulation on the STM32F407G is 3.5x. Although comparing subroutines directly is difficult due to interleaving in the pqm4 reference, the overall slowdown is larger than on the FRDM-KL82Z as the cost of polynomial is relatively lower due to assembly optimization. More precisely, the cost of masking is dominated by Keccak operations rather than polynomial arithmetic, which are more expensive to mask. Our slowdown factor is better than the (tentative) factor 4.2x reported recently by Heinz et al. for a first-order masked implementation [Dan21] at the PQC Standardization Conference organized by NIST, though their version is hardened for the Cortex-M4 while ours is unhardened. The overall slowdown is larger than the 2.52x factor reported in [BDK + 20, Table 5] for masking Saber on the same platform. This is mainly caused by the high cost of the Keccak-based binomial sampler: constructing the four error polynomials for Kyber requires the use of the binomial sampler which uses rejection sampling modulo 3329 to convert the shares from arithmetic to Boolean (256 per polynomial). A similar operation is also performed in Saber and Kyber for the generation of 3 secret polynomials. However, Saber uses rounding as opposed to random errors and therefore does not need to generate these errors vectors.
Unsurprisingly, the number of random bytes used by Kyber is larger than for Saber. Whereas [BDK + 20] makes 1262 calls to a 32-bit TRNG, using 5048 bytes in total, we sample a total of 12 072 random bytes. This is firstly caused by the generation of additional error polynomials, as mentioned above. Secondly, the Compress q (., 1) and DecompressedComparison components require more randomness compared to their counterparts in Saber and use 704 and 4396 random bytes respectively.
The performance impact for the higher order implementations is much larger. In particular, the relative cost of DecompressedComparison compared to the whole decapsulation increases. This is mainly due to the poor performance of the A2B component for higher orders [SPOG19,Algorithm 3], as opposed to the LUT-based version for first order, which is both slow and requires most of the random bytes in decapsulation. We expect many optimizations are still possible in the higher-order A2B.

Verification
In addition to the hand-written proofs of Non-Interference we employ the verification tool scVerif to mechanically verify the side-channel resilience of the introduced algorithms on assembly level in realistic leakage models [BGG + 21]. The disassembled object files of our implementation are verified to be Stateful (Strong) Non-Interferent in a fine-grained leakage model to ensure resistance against both the Cortex-M0+ device-specific leakage behavior and the residual state in concrete execution. Stateful NI differs from NI in that it mandates the state (e.g., registers and stack) after execution to be independent of secrets and random values, except for specified locations, and thereby facilitates secure composition of masked assembly components.
scVerif performs verification in fine-grained leakage models by augmenting an internal representation of the assembly code with user-supplied explicit leakages (denoted as "leakage model") in the form of leak statements which are considered as internal probes t G in the proof of (Stateful) NI (Definition 1). Verification with scVerif is split into two phases, (1) partial evaluation to lower the assembly representation into a simpler language amenable to (2) verification in the subsequent stage.
In the following, we detail the changes required to adapt scVerif to the Kyber implementation. The leakage model presented in [BGG + 21] is extended for arithmetic and shift instructions as well as branching instructions which are modeled without leakage. Our leakage model serves as design aid, containing known and assumed leakage behavior but comes without profound physical validation, as in e.g. [BGG + 21,MPW21]. Instead, we augment the model whenever a discrepancy between the model and the observed physical leakage is encountered by constructing tests for the concerned instruction as in [PV17]. This allows us to prove the absence of vulnerabilities arising from known leakage behavior. During our hardening phase we adopt the leakage of multiple single-operand instructions, which differs from instructions with multiple operands. The extended formal leakage model  Algorithm 3 Simplified scVerif code to represent table lookups for formal verification. LUT(Rd, Rn, Rm) 1: val ← (Rn + Rm − baseaddress L + r a ) ⊕ r b ; 2: leak lutOperand (opA, opB, Rn, Rm); 3: leak lutMemOperand (opR, val); 4: leak lutTransition (Rd, val); 5: opR ← val; opA ← Rn; opB ← Rd; Rd ← val; developed might be of independent interest and is provided in a Listing 1 in Appendix E.
Kyber makes use of constants (e.g., the modulus q) which are stored alongside the program code. We extend scVerif to allow program counter (pc) relative memory accesses to be partially evaluated for the subsequent verification phase. In doing so, we extend the front-end of scVerif to process the assembly .word directive which introduces a constant at some fixed address: pairs of addresses and constants are placed in the memory view ρ for the state p, c, µ, ρ, ec of the partial evaluator presented in [BGG + 21].
Verification of code containing table lookups, e.g., as used in the table-base A2B conversion, poses problems as it cannot be partially evaluated since the secret value (share) used as address or offset in the memory access is a symbol which does not resolve to a concrete table index. We resolve this by patching the code during verification and substituting the respective lookup instruction (e.g. ldr) with a virtual instruction (LUT) defined in scVerif intermediate language which exposes the same leakage behavior but expresses the semantic of the lookup in functional form.
Let us give an example of this approach for the implementation of the first-order A2B as explained in Section 4.1. In assembly the lookup in L is implemented as ldr Rd, [Rn, Rm], where Rd is the destination register, while Rn and Rm contain the base address of L and an offset. The accessed table index defined by Rn and Rm cannot be partially evaluated since it is secret dependent. To allow partial evaluation the instruction is substituted by the virtual instruction LUT implemented in the scVerif intermediate language depicted in Algorithm 3. The global variables r a and r b are annotated to contain uniformly distributed random masks, specific to the masked table. Line 1 is a leak-free assignment which allows one to convert and re-mask the sensitive value, satisfying the table's functionality L (a) = (a + r a mod q) ⊕ r b . The explicit leaks ensure that the side-channel behavior of the substituted ldr is equivalently modeled. Using this approach we verify our table-based A2B conversion to be Stateful Strong Non-Interferent.
In the subsequent verification of our Compress q (., 1) and DecompressedComparison implementations we replace calls (branches) to A2B and random number generators by simplified variants implemented in the scVerif intermediate language, exposing a worst-case leakage assumption that leaks a combination of all registers. This allows us to harden our implementation with respect to different implementations of random number generators and A2B implementations.
Given these pre-requisites, the security of Compress q (., 1) and DecompressedComparison is verified. The large size of our DecompressedComparison implementation forces us reduce its parameters (i.e., k = 1 and n = 64) for verification, while Compress q (., 1) can be verified for the Kyber768 parameters. Both components take nine minutes each to verify successfully; stating that both implementations are Stateful Strong Non-Interferent in our fine-grained leakage model.

Leakage Assessment
Finally, we evaluate the practical side-channel resilience of our hardened first-order implementation by performing statistical leakage detection on physical side-channel measure-ments. We use the KL82Z development board with capacitors C31, C39, C43, C45, C46, C59 and C61 de-soldered and an inductive current clamp connected on Jumper J15. A PicoScope 6404C oscilloscope samples the power consumption at 312.5 MS/s, a bandwidth of 500 MHz and 8 bit quantification. The micro-controller is clocked at 12 MHz, resulting in slightly more than 26 samples per clock cycle.
For leakage detection, we rely on the widely-used t-test-based Test Vector Leakage Assessment (TVLA) [GGJR + 11] comparing fixed with random inputs. In particular, a Welch t-test comparing the fixed and random measurements is computed, and the resulting t-value is compared to a set threshold of ±4.5, representing α = 0.0001. Informally, if the threshold is exceeded, it is assumed to be possible to distinguish between fixed and random inputs, which indicates the existence of exploitable leakage. We refer the reader to [CDG + 13, SM16] for more details on TVLA. However, Ding et al. have shown in [DZD + 17], that this threshold needs to be adapted for very long traces to avoid false positives during leakage detection. As our measurements indeed consist of numerous sample points, we adapt their approach to set the threshold for our leakage assessments to avoid erroneous results.
We measure the power consumption of the Cortex-M0+ processor executing 50 000 invocations of the algorithms on a fixed secret value which is freshly masked for each execution, and another set of 50 000 invocations on uniformly distributed secret values. In both cases, the implementation is provided with fresh, pre-sampled randomness stored in a table. In line with [OSPG18] and [BDK + 20], we instantiate the measured module DecompressedComparison with reduced parameter sets, i.e., k = 1 and n = 64 to mitigate the large size, while ensuring that the entire function can be assessed. The parameters are chosen such that loops are executed for at least two iterations. We choose the public compressed values in such a way that the invocations with the fixed value compare correctly to all but the last compressed coefficient, whereas the invocations on random (uncompressed) coefficients result in an invalid comparison to the fixed compressed coefficients with high chance. Only by comparing uncompressed to compressed coefficients which match in the fixed invocation, we can assess the secrecy of all intermediate comparisons and the handling of the resulting flag.
The Compress q (., 1) is assessed without reducing parameters (i.e., n = 256). For DecompressedComparison, the measurements consist of 1,782,438 sample points for which we set the threshold to 6.89 as described in [DZD + 17]. For Compress q (., 1), we need to process 1,726,452 sample points, and therefore set the threshold to 6.88.
For a first-order secure implementation, the assessment is expected to show no significant leakage at first-order, while exceeding the threshold at second order. To validate our setup, we first run the test when the randomness source turned off. In this case, the thresholds are exceeded for just 1000 traces, as depicted in Figures 6(a) and 6(d). The visible sawtooth pattern in Figure 6(a) corresponds to the bitsliced comparison of 32 coefficients in parallel.
In normal operation (i.e., randomness source turned on) our hardened algorithms do not exhibit significant first-order leakage at 100 000 measurements as can be seen in Figures 6(b) and 6(e). On the other hand, significant univariate second-order leakage is detectable, as depicted in Figures 6(c) and 6(f), indicating that second-order attacks are likely to succeed. To increase the SCA resilience beyond the first-order resilience which is provided by our first-order masked implementation, additional countermeasures are required, e.g., increasing the masking order. Our presented higher-order masked algorithms enable to implement Kyber at arbitrary orders, allowing to protect against higher-order SCA attacks.

Conclusion
In this work, we presented the first masking scheme for a complete Kyber decapsulation, at both first and higher orders. This is achieved by combining known techniques with two new approaches to respectively mask a one-bit compression and decompressed comparison. We prove both algorithms to be t-SNI and show how to compose them to create a masked Kyber.CCAKEM.Dec. We implement our proposed masking scheme on an Arm Cortex-M0+ and Cortex-M4F at orders one to three. For first-order masked Kyber, this resulted in an overhead factor of 3.5, 3.3 and 2.7 compared to unmasked for Kyber.CCAKEM.Dec, Kyber.CPAPKE.Enc, and Kyber.CPAPKE.Dec respectively on the Cortex-M4F. We explicitly hardened the first-order implementations of our new algorithms on the Cortex-M0+. Their leakage behavior was both formally and practically verified using scVerif and TVLA with 100 000 measurements. Both approaches do not detect leakage in our hardened modules of Compress q (., 1) and DecompressedComparison.

A Supporting Material: Proof Theorem 1
• G 13 (NI): The t G13 internal probes and o G13 output shares of the gadget can be simulated with t G13 + o G13 shares of x (·) B 11 and of the output of G 12 .
• G 12 (SNI): The t G12 internal probes and o G12 output shares of the gadget can be simulated with t G12 shares of the output G 10 and G 11 .
• G 11 (NI): The t G11 internal probes and o G11 output shares of the gadget can be 11 .
• G 10 (SNI): The t G10 internal probes and o G10 output shares of the gadget can be simulated with t G10 shares of x (·) B 10 and of the output of G 9 .
• G 9 (SNI): The t G9 internal probes and o G9 output shares of the gadget can be simulated with t G9 shares of x and of the output of G 8 .
• G 8 (SNI): The t G8 internal probes and o G8 output shares of the gadget can be simulated with t G8 shares of the output of G 7 .
• G 7 (NI): The t G7 internal probes and o G7 output shares of the gadget can be and of the output of G 6 .
• G 6 (SNI): The t G6 internal probes and o G6 output shares of the gadget can be simulated with t G6 shares of x and of the output of G 5 .
• G 5 (SNI): The t G5 internal probes and o G5 output shares of the gadget can be simulated with t G5 shares of the output of G 4 .
• G 4 (NI): The t G4 internal probes and o G4 output shares of the gadget can be simulated with t G4 + o G4 shares of x (·) B 8 .
• G 3 (NI): The t G3 internal probes and o G3 output shares of the gadget can be simulated with t G3 + o G3 shares of a (·) B .
• G 2 (SNI): The t G2 internal probes and o G2 output shares of the gadget can be simulated with t G2 shares of the output of G 1 .
• G 1 (NI): The t G1 internal probes and o G1 output shares of the gadget can be simulated with t G1 + o G1 shares of the input a (·) A .

B Supporting Material: Proof Theorem 2
• G 25 (SNI): The t G25 internal probes and o G25 output shares of the gadget can be simulated with t G25 shares of the output of G 23 and G 24 .
• G 24 (NI): The t G24 internal probes and o G24 output shares of the gadget can be simulated with t G24 + o G24 shares of the output of G 22 .
• G 23 (NI): The t G23 internal probes and o G23 output shares of the gadget can be simulated with t G23 + o G23 shares of the output of G 22 .
• G 22 (SNI): The t G22 internal probes and o G22 output shares of the gadget can be simulated with t G22 shares of the output of G 20 and G 21 .
• G 21 (NI): The t G21 internal probes and o G21 output shares of the gadget can be simulated with t G21 + o G21 shares of the output of G 19 .
• G 20 (NI): The t G20 internal probes and o G20 output shares of the gadget can be simulated with t G20 + o G20 shares of the output of G 19 .
• G 19 (SNI): The t G19 internal probes and o G19 output shares of the gadget can be simulated with t G19 shares of the output of G 17 and G 18 .
• G 18 (SNI): The t G18 internal probes and o G18 output shares of the gadget can be simulated with t G18 shares of the output of G 15 and G 16 .
• G 17 (SNI): The t G17 internal probes and o G17 output shares of the gadget can be simulated with t G17 shares of the output of G 13 and G 14 .
• G 16 (NI): The t G16 internal probes and o G16 output shares of the gadget can be simulated with t G16 + o G16 shares of the output of G 12 .
• G 15 (NI): The t G15 internal probes and o G15 output shares of the gadget can be simulated with t G15 + o G15 shares of the output of G 11 .
• G 14 (NI): The t G14 internal probes and o G14 output shares of the gadget can be simulated with t G14 + o G14 shares of the output of G 10 .
• G 13 (NI): The t G13 internal probes and o G13 output shares of the gadget can be simulated with t G13 + o G13 shares of the output of G 9 .
• G 12 (SNI): The t G12 internal probes and o G12 output shares of the gadget can be simulated with t G12 shares of the output of G 8 .
• G 11 (SNI): The t G11 internal probes and o G11 output shares of the gadget can be simulated with t G11 shares of the output of G 7 .
• G 10 (SNI): The t G10 internal probes and o G10 output shares of the gadget can be simulated with t G10 shares of the output of G 6 .
• G 9 (SNI): The t G9 internal probes and o G9 output shares of the gadget can be simulated with t G9 shares of the output of G 5 .
• G 8 (NI): The t G8 internal probes and o G8 output shares of the gadget can be simulated with t G8 + o G8 shares of the output of G 4 .
• G 7 (NI): The t G7 internal probes and o G7 output shares of the gadget can be simulated with t G7 + o G7 shares of the output of G 3 .
• G 6 (NI): The t G6 internal probes and o G6 output shares of the gadget can be simulated with t G6 + o G6 shares of the output of G 2 .
• G 5 (NI): The t G5 internal probes and o G5 output shares of the gadget can be simulated with t G5 + o G5 shares of the output of G 1 .
• G 4 (NI): The t G4 internal probes and o G4 output shares of the gadget can be simulated with t G4 + o G4 shares of the input v (·) A .
• G 3 (NI): The t G3 internal probes and o G3 output shares of the gadget can be simulated with t G3 + o G3 shares of the input v (·) A .
• G 2 (NI): The t G2 internal probes and o G2 output shares of the gadget can be simulated with t G2 + o G2 shares of the input u (·) A .
• G 1 (NI): The t G1 internal probes and o G1 output shares of the gadget can be simulated with t G1 + o G1 shares of the input u (·) A . Algorithm 4 Sampling from CBD η : B 64η → R 3329 as used in Kyber.

C Supporting Material: Proof Theorem 3
• G 7 (SNI): The t G7 internal probes and o G7 output shares of the gadget can be simulated with t G7 shares of the output of G 5 .
• G 6 (NI): The t G6 internal probes and o G6 output shares of the gadget can be simulated with t G6 + o G6 shares of the output of G 2 .
• G 5 (NI): The t G5 internal probes and o G5 output shares of the gadget can be simulated with t G5 + o G5 shares of the output of G 3 and G 4 .
• G 4 (SNI): The t G4 internal probes and o G4 output shares of the gadget can be simulated with t G4 shares of the output of G 6 .
• G 3 (SNI): The t G3 internal probes and o G3 output shares of the gadget can be simulated with t G3 shares of the output of G 2 .
• G 2 (SNI): The t G2 internal probes and o G2 output shares of the gadget can be simulated with t G2 shares of the output of G 1 .
• G 1 (NI): The t G1 internal probes and o G1 output shares of the gadget can be simulated with t G1 + o G1 shares of the inputŝ (·) A .

D Kyber Round 3 Tables & Algorithms E Leakage Models used during verification with scVerif
We provide the formal leakage model which was used to verify the security of our masked assembly implementations of Algorithms 1 and 2 for the Cortex-M0+ processor. Our leakage model was based on the model presented in [BGG + 21] but merely served as design aid. It was extended for assumed leakage behavior as well as observed effects but comes without profound physical validation. Every macro defines the leakage behavior of an instruction in the domain specific language "IL" (refer to [BGG + 21] for a detailed description), e.g., ands2_leak models the leakage of the Arm assembly instruction ands with two operands of 32 bits. The semantics of the instructions are provided in the supplementary material of [BGG + 21]. The model includes the virtual lut instruction which specifies the leakage model and semantic of the table for A2B conversion (Section 4.2).
Listing 1: Fine-grained side-channel leakage model used during verification of concrete assembly implementations.