Higher-Order Masked Ciphertext Comparison for Lattice-Based Cryptography

. Checking the equality of two arrays is a crucial building block of the Fujisaki-Okamoto transformation, and as such it is used in several post-quantum key encapsulation mechanisms including Kyber and Saber. While this comparison operation is easy to perform in a black box setting, it is hard to eﬃciently protect against side-channel attacks. For instance, the hash-based method by Oder et al. is limited to ﬁrst-order masking, a higher-order method by Bache et al. was shown to be ﬂawed, and a very recent higher-order technique by Bos et al. suﬀers in runtime. In this paper, we ﬁrst demonstrate that the hash-based approach, and likely many similar ﬁrst-order techniques, succumb to a relatively simple side-channel collision attack. We can successfully recover a Kyber512 key using just 6000 traces. While this does not break the security claims, it does show the need for eﬃcient higher-order methods. We then present a new higher-order masked comparison algorithm based on the (insecure) higher-order method of Bache et al. Our new method is 4.2x, resp. 7.5x, faster than the method of Bos et al. for a 2 nd , resp. 3 rd , -order masking on the ARM Cortex-M4, and unlike the method of Bache et al., the new technique takes ciphertext compression into account. We prove correctness, security, and masking security in detail and provide performance numbers for 2 nd and 3 rd -order implementations. Finally, we verify our the side-channel security of our implementation using the test vector leakage assessment (TVLA) methodology.

orders and provide a security proof, correctness proof, as well as a masking security proof. Finally, we practically verify our results by providing the TVLA results of our implementation of the first-and second-order comparison using the state-of-the-art 'Fixed + Noise' versus Random (FNvR) framework from [BDH + 21].
Outline. In Section 2, we recall Kyber and Saber, the Fujisaki-Okamoto transform, and previous proposals for masking the comparison operation. Then, in Section 3, we describe our attack on the first-order hash-based comparison method. We introduce our new masking approach and prove its security in Section 4. We then analyze its performance and verify the absence of leakage in Section 5. Finally, we conclude in Section 6.

Preliminaries
We now introduce some background on our target schemes Kyber and Saber, on the Fujisaki-Okamoto transform, which is the cause of requiring a comparison, and previous proposals aiming at securing said comparison.

Notation
We denote with x flooring a number x ∈ R to the closest lower integer and with x rounding x to the nearest integer with ties rounded upwards. Furthermore, we write y = x q→p to denote y = (p/q) · x for an input x ∈ Z q and y ∈ Z p . These operations are extended coefficient-wise for vectors and polynomials. For x, q ∈ Z we write x mod q to denote the integerx ∈ (−q/2, q/2] so thatx ≡ x mod q. For a vector or polynomial x we denote with x i taking the i th coefficient of x, which is sometimes made more explicit as x [i]. We denote with x $ ← − χ sampling x randomly according to the distribution χ, and with x r ← − χ sampling pseudorandomly based on the seed r. Let U(I) denote the uniform distribution over a set I. A variable A that is masked into S shares is denoted in bold A. When needed for clarity we distinguish between a Boolean masked variable A B and arithmetic masked variable A A . We denote the j th share with A (j) .

Saber and Kyber
The comparison technique presented in this paper is broadly applicable for masking implementations where two vectors need to be checked for equality. This can for example be necessary within the equality check of the FO transformation (see Section 2.3) of lattice-based Key Encapsulation Mechanisms (KEMs). In this paper, we will specifically target the use-case of masking implementations of IND-CCA secure KEMs Saber and Kyber.
Algorithms 1 to 3 give a generalized and simplified overview of the working of the IND-CPA secure encryption schemes Kyber and Saber. These can then be compiled to an IND-CCA secure KEM using the FO transformation as explained in Section 2.3. Both Kyber and Saber work with vectors of ring elements in R k q , where R q = Z q [X]/(X n + 1), with n = 256, and with k an integer between 2 and 4 depending on the security. The distributions χ(R k×1 q ) generate vectors of polynomials with coefficients following a small binomial distribution. The modulus q = q 2 for Kyber is chosen as a prime, and the compression moduli p and T are powers of two. In Saber, all moduli q, q 2 , p, T are powers of two, with q > q 2 = p > T . For a more in-depth discussion of Saber and Kyber, we refer to [DKRV18] and [BDK + 18].

FO-transformation
The FO transformation [FO99,HHK17] is a generic method to convert an IND-CPA secure encryption scheme into an IND-CCA secure KEM. In the FO-transformation, the encapsulation is a deterministic version of the encryption, where all randomness is pseudorandomly based on the message (which is itself chosen at random). This means that one can exactly recompute the ciphertext using the message. The high-level idea is that during decapsulation, the message is decrypted and then used to check if the inputted ciphertext is well-formed. This check is performed by re-encrypting the message and checking if the input ciphertext is equal to the re-encrypted message. An overview of the FO transformation is given in Algorithms 4 to 6, where G and H represent hash functions and where KDF is a key derivation function.
When the input ciphertext is invalid, the decapsulation will return unusable randomness so that an adversary does not gain information (except for the fact that the ciphertext was rejected). The security of the FO transformation relies on the fact that in this case, an attacker does not learn anything about the intermediate values of the computation. Protection of these routines against side-channel attacks is, therefore, of utmost importance.

Hash-Based Method [OSPG18]
In [OSPG18], an efficient method for constructing a first-order masked ciphertext comparison is presented. While the original method has a subtle flaw [BDH + 21], there exists an easy fix, and the general method was used in several secured implementations of e.g., Saber and Kyber [FVBR + 21, VBDK + 21].
Assume that the re-encryption outputs its result c in two Boolean shares (c (0) , c (1) ), i.e., c = c (0) ⊕ c (1) . To test if c is equal to c , [OSPG18] propose to compute (1) Only if c = c , the two hash function calls receive the same input and the hashes match. Due to the random sharing of c , the actual inputs are randomized for each decapsulation. A drawback of this method is that it only works for first-order maskings. In this paper, we will generalize to both, prime and power-of-two moduli, and to schemes that undergo ciphertext compression. The core idea for of the random sum method is to test if a list of input coefficients are all zero: given n masked coefficients in S shares D 0 , · · · , D n−1 ∈ Z S q we want to calculate for every i, if the sensitive unmasked sum j D (j) i is zero. Instead of performing the zero check for all coefficients individually, we compress them in one term. However, to avoid giving away sensitive information we first want to do this separately for each share, by calculating

Random Sum Method [BPO
for each share 1 . In a second phase one then checks if j E (j) = 0. On one hand, if every unmasked coefficient j D (j) i = 0 then clearly j E (j) = 0. On the other hand if at least one of the coefficients is not zero, the term j E (j) will only be zero with limited probability (depending on the random R i 's).
For a prime q, one can show that the probability of such a false positive response is equal to the probability that a random element of Z q is zero, which is 1/q. For a power-of-two q, an adversary can choose his input coefficients as q/2 or 0 to increase this false positive probability to 1/2. As in both cases, the straightforward scenario does not give enough certainty to obtain security in typical cryptographic applications, it is proposed in [BPO + 20] to replicate the check L times, to limit the false positive probability to 1/q L (or 1/2 L times for power-of-two q). In [BPO + 20] this is done by dividing the coefficients into k sets and performing the check on each subset individually.
In [BDH + 21] two vulnerabilities in the above repetition method were shown: Firstly, leakage of intermediate results of the check results leads to a first-order side-channel attack and must therefore be avoided. Secondly, performing the intermediate checks on only a subset of the coefficients leads to a chosen-ciphertext attack where the adversary inputs a slightly adapted ciphertext that only fails in one of the L intermediate checks, increasing the false positive probability to 1/q (or 1/2 for power-of-two q). The authors concluded ). Bhasin at al. [BDH + 21] showed that one does not need the second randomness R 1i . In this paper we will for sake of clarity discuss their simplified method. that a random sum algorithm must not leak any results of the intermediate checks and that it must always be calculated over all coefficients.
In this paper, we construct a random sum-based technique that deals with the problems in the original technique by moving to a large q. This reduces the false-negative probability, which allows relying only on one check (L = 1). More information about our technique will be given in Section 4.
3 Side-Channel Collision Attack on a 1st-Order Impl.
In this section, we present a side-channel attack that can easily break the first-order hashbased comparison method of [OSPG18]. Our attack belongs to the category of horizontal collision attacks [MME10], i.e., side-channel leakage is used to test if the same data is processed at two selected points during computing H(c ⊕ c (0) ) ? = H(c (1) ). As this can be considered a second-order attack, no security claims of the masking approach are broken. Still, the attack is unprofiled, requires only minimal knowledge of the implementation, is likely robust in terms of noise, and for these reasons easy to perform. Hence, it shows that the hash-based first-order approach and likely also other first-order approaches using Boolean masking do not offer sufficient protection.

Attack Description
We use Kyber for all our explanations and experiments, but we note that the hash-based comparison approach, and hence our attack, is also applicable to Saber. Our attack follows along the lines of the generic side-channel attack using a decryption failure oracle described in [BDH + 21]. That is, we honestly generate a ciphertext and then manipulate a single coefficient of the second ciphertext component c 2 (corresponding to v) while keeping the first component c 1 (corresponding to u) untouched. Depending on the concrete value of the secret key, this manipulation can lead to decryption failure, i.e., the recovered message m can differ from the m used during encapsulation in a single bit.
Since the random coins r used for re-encryption are derived by hashing m , this singlebit error leads to entirely different values being used during re-encryption and thus a ciphertext c having no resemblance of c. If, however, no decryption failure occurs, then c and c differ only in a couple of bits (one coefficient of c 2 ). While these two scenarios are not discernable in a black-box setting (both lead to the ciphertext being rejected), they can be distinguished using side-channel measurements, which in turn gives information on the secret key.
Concretely, we use that when back-substituting decryption, we get with ∆u and ∆v denoting the error introduced by compression of the two ciphertext components. We call d = e T r − s T (e 1 + ∆u) + e 2 + ∆v (3) the decryption noise. The parameters of Kyber are chosen such that ||d|| ∞ < q/4 with very high probability to ensure that no decryption errors occur. Since all elements of d are in some form small and the range is limited, d can be lifted to Z.
For the attack, we honestly generate a ciphertext and then add q/4 to one selected coefficient of v. 2 The corresponding message bit at index i will still be correctly decoded only if d i + q/4 < q/4, i.e., if d i < 0. If, however, d i ≥ 0, then a decryption error will occur. 3 Detecting decryption errors. We use the following method to detect decryption errors in the first-order hash-based masked comparison. As explained in Section 2.4.1 during the comparison the two masked ciphertext components are hashed and their equality is checked as H(c ⊕ c (0) ) ? = H(c (1) ). First, we note that for Kyber, c is at least 768 bytes large, which is larger than the block size of the hash function. For example, when initiating H with SHAKE128, which has a block size 1344 bits, c is split into at least 5 blocks. Moreover, the ciphertext consists of two parts c = (c 1 ||c 2 ). Since c 1 is typically larger than c 2 , we have that c 2 is only part of the last input blocks (in our case 2), while all previous blocks only contain c 1 .
We can now differentiate based on whether a decryption error occurs or not. Remember that in case of a decryption error, c is essentially independent of c as it is generated using entirely different coins r . In this case, the inputs to H(c ⊕ c (0) ) and H(c (1) ) differ already in the first block. If, however, the correct message is still recovered, then c and c differ only in a few bits of c 2 in the last hash blocks. Followingly, the first couple of blocks of H(c ⊕ c (0) ) and H(c (1) ) use identical input.
Thus, by using a side-channel collision attack on the first blocks of H(c ⊕ c (0) ) and H(c (1) ), e.g., comparing the power consumption and determine if they are similar, one can determine if a decryption failure occurred, and followingly, the sign of d i . This collision attack can use a large portion of the trace (at least 3 full Keccak-f permutations), which is why a single trace is usually sufficient and the noise robustness is high.
Solving for the key. After gathering many traces and extracting the sign of the respective d i , one needs to extract the key from this information. In [BDH + 21], the relation (v−us) i ≈ m i · q 2 + d i 4 is fed to the Leaky-LWE framework of Dachman-Soled et al. [DDGR20]. In this framework, the lattice described by the public-key equation t = As + e is transformed using hints gathered via side channels. We found that this approach is not ideal for the problem at hand: the runtime needed for including the hints in the lattice is quite high and the security level decreases very slowly with the number of traces. For instance, after gathering 2 17 approximate equations, each requiring multiple measurements, [BDH + 21] still report a key-recovery complexity of roughly 2 64 operations for Kyber512.
Recently, [PP21] and [HPP21] presented fault attacks on several CCA-secure latticebased KEMs, including Kyber. Their attacks can be classified as safe-error attacks [YJ00] in that they inject a specific fault and then observe if decapsulation still returns the correct result. As it turns out, their key-recovery problem is identical to ours. They propose a different solving approach, which we now briefly describe.
Their approach exploits the composition of d, i.e., the structure of the right-hand side of eq. (3). Note that if the attacker honestly generated the ciphertext by running encapsulation using the public key, then only the secret key (e, s) is unknown; all other values in d are also generated or can be computed during encapsulation. Since d i is small, we thus have that d i is linear (in Z) in the key coefficients. Since the side-channel attack only extracts the sign of d i , the equalities turn into inequalities of form e T r − s T (e 1 + ∆u) + e 2 + ∆v 0. (4) After gathering enough inequalities, they solve for the key using an approach akin to linear decoding. We employ the improved method of [HPP21], which reaches a success rate close to 1 when using roughly 5750, 6750, and 8500 inequalities (and thus measurements) for Kyber512, Kyber768, and Kyber1024, respectively.

Attack Setup and Measurement
We verified the correctness of our attack by attacking a microcontroller running Kyber. Our target is an STM32F405 (ARM Cortex-M4) mounted on a ChipWhisperer sidechannel evaluation board [Newb]. The microcontroller was clocked at 24 MHz to match the frequency used in the popular PQM4 PQC benchmarking framework [KRSS]. We measured the voltage drop over the on-board shunt resistor using a LeCroy AP034 differential probe. Due to the length of the measurements (multiple Keccak-f permutations per trace) and the limited sample memory of the ChipWhisperer Lite platform [Newa], we performed measurements using an oscilloscope sampling at 100 MS/s. To reduce noise, we used the scopes in-built 20 MHz analog filter. The microcontroller runs the most recent ASM-optimized Kyber512 implementation included in PQM4 [KRSS]. This implementation is unprotected; we added the first-order masked comparison as described above, where we used the included ASM-optimized SHAKE128 for H. A trigger signal is set to mark the start of the masked comparison, i.e., the computation of the hash functions. We generated the manipulated ciphertexts on a PC (using the known public key) and then send them to the device for decapsulation.
Trace processing. Each trace contains two invocations of H, namely H(c ⊕ c (0) ) and H(c (1) ). To localize these calls and align them for the horizontal side-channel attack, we compute an autocorrelation over the trace and then select its peak index as the beginning of the second subtrace. This method can also be used to find the hashes if no dedicated trigger signal is available.
We then perform a pointwise subtraction of the two trace segments and compute the mean of the squared difference. If this quantity is above a certain threshold, we conclude that the two hash calls processed vastly different inputs, i.e., a decryption error occurred. If the score is below the threshold, then the message m was correctly recovered and the first input blocks of the two hashes are identical. This principle is illustrated in Figure 1, which shows collision scores obtained by adding all 2 T − 1 = 15 possible offsets to one coefficient c 2 (compressed v).
We dynamically determine the threshold by sending two modified ciphertexts and taking the midway point of the mean-squared difference as threshold. The first ciphertext is random, hence a decryption error will occur and the hashes will have different inputs. The second ciphertext is honestly generated, but we add 1 to one coefficient of c 2 . It is highly unlikely that this change leads to a decryption error, the hash inputs thus differ only in 1 bit in one of the later blocks.

Results
We performed a total of 10 experiments, each using a different key. We collected 6000 measurements per experiment, which, according to [HPP21], should suffice for key recovery. All 10 experiments were successful; we only observed 2 misclassified d i throughout all 60 000 measurements.

Higher-Order Masked Comparison
In this section, we describe our new higher-order masked comparison. We start with explaining our method and continue with three security proofs, showing correctness, security, and masking security. Contrary to previous random sum methods, our method works for both schemes that use a power-of-two and prime modulus and so accommodates both Saber and Kyber.
As explained in Section 2.4.2, the random sum method has a false-positive probability in which the method wrongly thinks all input coefficients equal zero. In both [BPO + 20] and [BDH + 21], false positives happen with a non-negligible probability, and these works proposed to perform multiple checks to further reduce this probability.
In this work, we want to perform only one check. We achieve this by enlarging the masking modulus q up to a point where the false positive probability is sufficiently small. While previous works only considered a prime modulus as it avoids zero divisors that lead to a higher false positive probability, we prove that in our design it is also possible to use a power-of-two modulus. We specifically choose a large power-of-two modulus as it interplays well with other masking techniques such as Arithmetic to Boolean (A2B) or Boolean to Arithmetic (B2A) mask conversion. However, it would also be possible to choose a large prime as new modulus. We specifically focus our method on lattice-based encryption schemes with compression (e.g., Saber [DKRV18] or Kyber [BDK + 18]) which require additional preprocessing of the inputs.
Algorithm 7 gives an overview of our technique. It relies on three subfunctions: A2B, which transforms an arithmetic sharing into a Boolean sharing, B2A p→q , which transforms a p bit Boolean masking to a q bit arithmetic masking, and BooleanEqualityTest, which tests if a sharing of a single coefficient E represents zero (i.e., S E (j) = 0).
The algorithm consists of five steps.
Step 1 and 2 convert the input from a small modulus to a larger modulus p · 2 s−1 where s is a security parameter that will determine the false positive probability (which will equal 2 −s ). In step 3 we perform the compression as proposed in [BPO + 20] and improved in [BDH + 21] and step 4 performs a check on only one (large) coefficient.
Step 0 does an application-specific preprocessing. For our case this consists of three parts: First, in the case of lattice-based encryption, we actually need to compare two coefficients (a masked one with a public one) instead of performing a zero check. The addition of the constant 2 f bits,B /2 is used to mimic a rounding operation when shifting right in step 1. By subtracting the public coefficients from the first share of each corresponding masked coefficient (lines 5 and 11) we convert the comparison to a zero check.
Secondly, for non power-of-two moduli q we scale to a power-of-two modulus to make the conversion to the larger modulus p · 2 s−1 easier (line 3 and line 9). To avoid introducing errors during rounding we take into account a sufficient number of fractional bits. In practice we have to choose: For example, f bits,B = f bits,C = 13 for masked Kyber with two shares as discussed in [FVBR + 21]). This step is ignored for power-of-two schemes, where 2 f bits,B = q/p and 2 f bits,C = T /p. Finally, we need to perform compression from modulus q to a modulus p or T . This is performed in line 14 and line 17 as this operation is trivial to do in the Boolean domain. Note that previous random sum comparison methods [BPO + 20, BDH + 21] were not capable of handling the compression of the ciphertext.
As with previous random sum methods, our method is prone to false positives (also named collisions in [BDH + 21]) with a small probability of 2 −s . When such a collision occurs, a non-valid ciphertext is accepted and as shown in [BDH + 21] an adversary can use these collisions to reduce the security of the attacked scheme.
An adversary needs to submit on average 2 s invalid ciphertexts to obtain one collision. In our security proof we will show that this collision probability is independent of the input of an adversary and as such an adversary can not increase the probability of triggering a collision. Therefore techniques like failure boosting [DGJ + 19] to reduce the failure probability are not applicable in this context. Moreover, multiple collisions would be required to significantly reduce the security of the scheme.
The security parameter s can be chosen at any arbitrary value depending on the application scenario (at the cost of additional operations and randomness). In our implementation, we choose s = 54 so that the maximal bitwidth of the variables is 64 bits. This corresponds to a collision probability of 2 −54 or an expected 2 54 queries that an adversary needs to do to obtain one collision. We provide additional implementations with s = 118 (maximal bitwidth 128 bits) and s = 128 (maximal bitwidth 138 bits) in Appendix A.

Security Proof
Theorem 1 (Correctness and Security of Algorithm 7). Let p, T be powers of two, let q be a power of two or a prime, and let k, n, s be integers. If q is prime then we require f bits,B and f bits,C to be larger than log 2 (S) − log 2 By simple substitution we have: We continue by focussing on the term j B B,(j) i , which can be further expanded to: For power-of-two moduli where 2 f bits,B = q/p the inner flooring operates on integers and can therefore be ignored. For non power-of-two moduli q we used the trick as described in [FVBR + 21]. In this case we can rewrite the equation as: where e (j) is the flooring error that can be bounded by 0 ≤ e (j) < 1. Note that we can drop the error term as long as this does not produce an overflow, that is y − e + 1 2 = y + 1 2 as long as e < (y + 1 2 mod 1). On one hand we can see that y + 1 2 mod 1 has only a limited number of possible values, which can be decribed as the set {i/q + 1 2 mod 1 | i ∈ [0, q)]} (as described in [FVBR + 21]). The worst case scenario is with i = q/2 , in which case we have that y + 1 2 mod 1 = q/2 q − 1 2 . On the other hand, we have a worst case error value e (j) = 1, so that e < S/2 f bits,B . Thus, to remove the error term we need to have that: e < (y + 1 2 mod 1) or: S/2 f bits,B < q/2 q − 1 2 or: (13) Since we chose f bits,B to fulfill this condition, we can remove the error and write: An analogous derivation is also possible for the C terms. For sake of convenience we will denote β i = Correctness When the conditions are met, i.e. Security When the conditions are not met, at least one of the coefficients of B or C does not equal the corresponding masked value of B * or C * . We will first look at the case where at least one of the coefficients of B does not match its corresponding value, and treat the case where only coefficient(s) of C differ later. Without loss of generality, we can assume that a non-corresponding coefficient is located at the zeroth coefficient (i = 0). The output of Algorithm 7 is binary and the (incorrect) return value 1 is given when j E (j) = 0 mod p · 2 s−1 . We will first derive an equivalent condition in terms of the inputs and then show that due to the randomness of R 0 , it is hard to guess an input that gives returns 1 when the condition is not fulfilled.
Rewriting j E (j) = 0 mod p · 2 s−1 using Equation 17 we get: Now we choose z so that 2 z = gcd(p · 2 s−1 , β 0 ). Notice that 2 z ≤ p/2 as β 0 is strictly smaller than p. Moreover, when an incorrect response 1 is given, Equation 20 is satisfied and thus 2 z divides We can then define the variables X and Y as follows: To rewrite the condition (Equation 20) further as: Where in the latter step we use the fact that 1 = gcd(X, p · 2 s−1−z ) to conclude that X has an inverse. Notice that the modulus of this expression is p · 2 s−1−z and we know that 2 z ≤ p/2, from which we can conclude that the modulus p · 2 s−1−z ≥ 2 s and thus R 0 retains its entropy in this expression.
As R 0 is independent of the terms X and Y and is unknown to any adversary, the probability of obtaining j E (j) = 0 mod p · 2 s−1 when the condition is not met is upper-bounded by the guessing probability of R 0 which is 2 −s . A similar reasoning can be constructed if only coefficients of C are incorrect. The only difference is that 2 z is upper-bounded as 2 z ≤ T /2 due to the different modulus. However, as T ≤ p we can state 2 z ≤ p/2. The rest of the proof is analogous to the one outlined above.

Theorem 2 (t-(S)NI of Algorithm 7). Let BooleanEqualityTest be a t-NI gadget and let A2B and B2A be t-SNI gadgets. Let B * and C * be the masked inputs of Algorithm 7. For any set of t c ≤ t intermediate variables there exists subsets I B and I C of input indices with |I B | + |I C | ≤ t c such that the t c intermediate values can be perfectly simulated from the input variables B * ,(I B ) and C * ,(I C ) . This t-NI security also implies t-SNI security as there is no sensitive output value.
Proof. We divide Algorithm 7 into five types of gadgets representing the five steps in the algorithm. The first three types of gadgets G 0 to G 2 work on the coefficients individually, outputting variables with S shares. Gadget G 3 combines all coefficients, but works on each share individually (different shares are indicated with different arrow types in Figure 2).
Each gadget is subdivided into separate subgadgets that perform the same operation on different input data. An overview is given in Figure 2. The exact definition of the gadgets is: Step 3 G 3,Ej line 27-29 (for one specific j) G 4 Step 4 G 4 line 32 Starting with the last gadget we will work our way back through the algorithm. We will denote the number of intermediate values probed by the adversary in each gadget with t 0 to t 4 , for G 0 to G 4 respectively. The adversary can probe at most t c = t 0 + t 1 + t 2 + t 3 + t 4 ≤ t intermediate variables.
Gadget G 4 Gadget G 4 is by definition t-NI. As it has no sensitive output variables, output probes are not relevant. By definition of t-NI, the t 4 intermediate variables can be simulated with at most t 4 input values.
Gadget G 3 There are S instantiations of gadget G 3 , each linked to a specific share number j of the output values of G 2 . Each gadget G 3,Ej outputs one variable by iteratively calculating: with the R i random values. One can simulate the output value for gadget G 3,Ej using all D corresponding the number of the gadget j. As such, to simulate the distribution of the probes in G 3 and G 4 one needs to probe at most t 4 + t 3 ≤ t of the shares of each G 2 and thus at most t 4 + t 3 ≤ t of the output variables of each gadget.
Gadget G 2 The gadgets G 2 are t-SNI secure by definition. From above we know that each gadget is output probed in at most t 4 + t 3 output variables. There are t 2 intermediate probes divided over the different gadgets G 2,Bi and G 2,Ci , the number of which we will denote with t 2,Bi or t 2,Ci respectively. We have that t 2 = i t 2,Bi + t 2,Ci . In the following we will focus on the gadgets G 2,Bi , but the reasoning can be trivially adapted for G 2,Ci .
As each gadget is t-SNI secure we know that any O output variables and t 2,Bi intermediate variables can be simulated by at most t 2,Bi input variables, if O + t 2,Bi ≤ t. From above we know that O ≤ t 3 + t 4 and t 2,Bi ≤ t 2 and thus O + t 2,Bi ≤ t. This means that each gadget can be simulated with at most t 2,Bi input probes and thus G 2 can be simulated with at most t 2 input probes.
Gadget G 1 The gadgets G 1 are also t-SNI secure by definition. Therefore, with a similar reasoning as for gadget G 2 , for at most t 2 output probes and t 1 intermediate probes, we have that the gadget G 1 can be simulated with at most t 1 input probes.
Gadget G 0 Each output variable can be trivially simulated using the corresponding input variable as the gadgets in G 0 are simple operations on one share. The intermediate values of the gadgets G 0 can also be simulated by the corresponding input variable. As such we have |I B | + |I C | ≤ O + t 0 ≤ t 1 + t 0 ≤ t, which proves our theorem.

Evaluation
After describing our new approach in depth, we now analyze its performance. We also verify its soundness in practice using side-channel measurements.

Subroutines
Our masked comparison algorithm requires subroutines for A2B and B2A mask conversion, as well as BooleanEqualityTest. There are many ways to instantiate these subroutines. In this section, we detail our choices, aimed at maximum performance.
For A2B conversion, we employ the method of [CGV14, Algorithm 4]. As in [SPOG19], we replace the function Expand [CGV14, Algorithm 5] with its t-SNI variant, RefreshXOR [BBE + 18, Algorithm 8]. The core building block of this specific A2B conversion is a Boolean-masked binary adder. In such an adder, every call to an XOR (for the sum) or AND (for the carry) is replaced by its masked variants SecXOR and SecAND. This is trivial for XOR in a Boolean masking, but requires special treatment for SecAND. The Boolean-masked binary adder requires bit-level manipulations, quickly leading to a large computational blow-up depending on the size of the input (e.g p · 2 f bits,B for B A ). [CGV14] described a method to avoid bit-level manipulations, but otherwise keep the same complexity. However, we can exploit the fact that we need many conversions in parallel by bit-slicing the implementation. On the 32-bit Cortex-M4, our A2B implementation uses bit-slicing to compute 32 conversions in parallel.
For B2A conversion, we employ the method of [BCZ18]. We note that this conversion is restricted to power-of-two moduli for the arithmetic masking. However, as mentioned in Section 4, we can freely choose the (large) masking modulus to switch to. We choose a power-of-two modulus, specifically to be able to use this efficient B2A conversion method.
After the right-shift on lines 14 and 17, the boolean shares B B and C B are of limited bit-width. An ideal B2A conversion for this task is the one of [SPOG19], which converts each bit separately. However, in our experiments, the B2A conversion of [SPOG19] was outperformed by the conversion of [BCZ18] 5 .
Finally, BooleanEqualityTest requires to check that S−1 j=0 E (j) = 0. We first convert this to E B = 0 using a A2B conversion from arithmetic sharing modulus p · 2 s−1 . Then, for every bit i of E B we have that E B [i] = 0. This can easily be implemented as the masked circuit:

Performance Evaluation
To measure the performance of our new masked comparison algorithm, we benchmarked our technique on an STM32F407-DISC1 board that features an ARM-Cortex M4F. We compile with -O3, using arm-none-eabi-gcc version 9.2.1. We use the same settings as the popular PQM4 benchmarking framework [KRSS], i.e. a 24 MHz system clock and a 48 MHz TRNG clock. We sample all masking randomness from the on-chip TRNG, and we include the sampling cost as well as the total number of requested random bytes into the benchmarks. On the STM32F407, the TRNG can supply 4 random bytes every 40 TRNG clock cycles, which corresponds to 20 cycles for the main system clock.
In table Table 2, we show the cycle counts of our implementation for the two considered schemes, Kyber and Saber, and their main parameter sets (k = 3). We further profile our implementation and break down cycle counts and random bytes in terms of the five steps of Algorithm 7. We benchmark our implementations with collision probability 2 −s = 2 −54 , corresponding to a maximal bitwidth of 64 bits for the internal variables in Steps 2-4. We provide additional implementations with s = 118 (maximal bitwidth 128 bits) and s = 128 (maximal bitwidth 138 bits) in Appendix A.
Steps 1 and 2, i.e. the A2B and B2A conversions, clearly constitute the main computational bottlenecks, taking up to 95% of the total algorithm execution time. High-performant conversions are therefore critical, and this motivates our choices such as bit-slicing in the previous section. In our employed conversions, the complexity of A2B is quadratic in the number of shares, whereas the complexity of B2A is exponential. This difference is already visible for Saber with 4 shares (3 rd -order masking), where B2A becomes the more costly routine.
Our masked comparison algorithm differs between Saber and Kyber only in the preprocessing and the A2B conversion. The cycle count difference is most noticeable for the A2B conversion. In this conversion, we have for Saber that p = 2 10 , T = 2 4 , f bits,B = 3, and f bits,B = 6. In other words, we need 13-bit A2B conversions for B and 10-bit A2B conversions for C. For Kyber, we have that p = 2 10 , T = 2 4 , and f bits,{B,C} = 13. In 5 In the sequence B2A(A2B(·) f bits ) it is also possible to only compute B2A conversions for the carry, further limiting the bit-width. Even with this optimization (called A2A conversion in [VBDK + 21]) we found that [BCZ18] is the preferred B2A conversion. other words, we need 23-bit and 17-bit A2B conversions. Furthermore, to avoid a rounding error as explained in [FVBR + 21], for Kyber f bits has to increase logarithmatically with the number of shares, e.g. f bits = 14 for 3 shares and f bits = 15 for 4 shares. The complexity of A2B is linear in the number of considered bits, leading the increased runtime for Kyber in Step 1. The randomness consumption follows the same trend as the cycle counts. Due to the increased number of iterations within the A2B routine, Kyber requires additional random bytes in Step 1. For B2A, similarly to its cycle counts, the randomness consumption increases exponentially with the number of shares.
We also compare our new technique to results reported by Bos et al. [BGR + 21] in their masked implementation of Kyber. Rather than develop a masked compression method for the re-encrypted ciphertext, DecompressedComparison chooses to decompress the input ciphertext. Subsequently, a masked range check is employed to determine ciphertext equality. Our method achieves factors 4.2x and 7.5x cycle count improvements for 2 nd and 3 rd -order maskings, respectively. In their higher-order masked Kyber implementations, the masked comparison accounts for 50%, resp. 63%, of the total execution of masked decapsulation, and our speedup could therefore contribute significantly to reduce overall cycle counts.
In this work, we are concerned with a higher-order masked comparison method. Compared to [BGR + 21], for a 1 st -order masking we did not employ custom A2B and B2A algorithms, specialized only for this case. As a result, for only 2 shares our algorithm is outperformed by other solutions (factor 5.0x for [BGR + 21]).

Leakage Evaluation
We now present the results of our leakage evaluation. We performed side-channel measurements (power consumption) using the ChipWhisperer Lite [Newa] board. The target device is an STM32F303 board with an ARM Cortex-M4 core running at 7.37 MHz. We capture the traces with a sample rate of 29 MS/s; the sample clock is synchronized to the device clock [Newa]. The code for the side-channel evaluation was compiled using arm-none-eabi-gcc, version 9.2.1. We take measurements on the complete algorithm comparing two polynomials modulo 2 13 (Saber) with 32 coefficients, the smallest possible option due to the bit-sliced A2B implementation. For the sake of clarity, only these results are included in this section. To ensure that we do not miss any leakage due to the limited sample buffer of the ChipWhisperer Lite, we perform measurements on the smaller building blocks individually and include them in Appendix A.
We show that our method is applicable in practice and does not have any obvious  weaknesses when confronted with first and second-order side-channel attacks. We use the non-specific t-test of the Leakage Assessment Methodology presented in [SM15]. More concretely, we use the fixed + noise variant presented in [BDH + 21], which additionally detects leakage caused by possibly unmasked partial comparisons. The first-order t-test statistic is calculated as t = µ 0 − µ 1 where m 0 denotes the sample mean, s 0 denotes the sample variance, and n 0 the sample size for the traces with fixed + noise input. The sample mean, sample variance, and sample size for the random set are denoted by m 1 , s 1 , and n 1 , respectively. The methodology presented in [SM15] calculates a threshold t-value of 4.5 to obtain a confidence of 0.99999 to correctly reject the null hypothesis, i.e. the two sets are not distinguishable. An absolute t-statistic larger than 4.5 thus indicates leakage.
When the Random Number Generator (RNG) is turned off, as in Figure 3, we can observe that the algorithm shows first-order leakage after only 10 000 measurements. This expected result confirms a correct setup of our measurement equipment.
We activate the RNG in our next experiment using two shares. The results are shown in Figure 4a. We cannot identify any peaks in our measurements with 100 000 executions. Additionally, we provide the first-order t-test statistics for the three-share option of our algorithm in Figure 4b, where we also cannot detect any first-order leakage.
For the bivariate second-order t-test, we first combine each trace at two points in time T = {i, j} using the so-called centered product with y ∈ {0, 1}. Then a second-order t-test is performed on the resulting two-dimensional traces. As presented in [SM15], d-th order central moments CM d = E((X − µ) d ) are required. For our evaluation, the mean of the first-order t-test µ is replaced with the second-order central moment CM 2 , whereas the variance is set to CM 4 − CM 2 2 . Using the methodology of [SM15], we calculate the t-statistic iteratively without separate sampling, combination, and calculation steps. However, as this step is computationally very expensive, we have to reduce the number of sample points during capturing significantly. This carries the risk of missing leakage and thus a more efficient higher-order evaluation approach might be interesting future work.
In contrast to the first-order case, the effects of applying many t-tests simultaneously are non-negligible in the second-order scenario. Similar to [BPO + 20], we can apply the Šidák Correction t th = Q t (1 − L √ 1 − α, ν) proposed in [BPG18], where Q t is the quantile function of the t-distribution, L is the trace length, α is the confidence level, and ν is the degree of freedom, to obtain a valid threshold value. In our example of 500 2 sample points, α = 0.00001 and 100 000 traces, the threshold t-value results in 6.50.
In the first-order implementation with two shares, the bivariate second-order t-test shows some clear leakage points even when only 10 000 captured traces are taken into account. This is expected behavior because a first-order implementation, in general, can not withstand second-order attacks. We graphically illustrate the result in Figure 5, where we depict the color of each 10x10 time sample square according to the maximum absolute t-statistic.
In contrary to the first-order implementation, a second-order implementation should not leak using a second-order t-test. The result of the second-order t-test of our three-share version is shown in Figure 6. As expected, even with 100 000 traces, no leakage points above 6.5 appear. Thus, all practical experiments confirm our theoretical results.

Future Work
Our new masked comparison algorithm heavily draws on A2B and B2A conversion techniques. Consequently, its performance depends crucially on the performance of these two algorithms. Table-based conversions are especially appealing, but they have been restricted to first-order masked implementations [CT03,Deb12,VDV21]. Concurrently with our work, higher-order table-based mask conversion methods have been proposed [CGMZ21], specifically focused on lattice-based cryptography. These methods boast increased performance compared to the A2B and B2A conversions we use in this work. The authors use their new techniques to mask the CPA-secure decryption and the binomial sampling, but specifically leave the masking of the polynomial comparison as future work. Opportunely, the novel masked comparison method that we described in this work is generic and can work with any A2B or B2A. Therefore, integrating these new higher-order table-based conversion methods into our masked comparison is a clear direction for future work.

A Supplementary Security
Our main benchmarks use s = 54, corresponding to a collision probability of 2 −54 , and requiring a 64-bit type in Steps 2-4 of our Algorithm 7. We believe this collision probability will be sufficiently low for the majority of use cases, but do provide implementation results for s = 118 and s = 128 in Tables 3 and 4. These implementations require, respectively, a custom 128-bit type and 138-bit type in Steps 2-4 of Algorithm 7. Moving up from a 64-bit type causes bit-wise operations and randomness sampling to increase by roughly a factor of two in these steps, whereas the multiplications in Step 3 incur a higher overhead. Even with a reduced collision probability requiring custom types, our algorithm outperforms [BGR + 21].