High-order Polynomial Comparison and Masking Lattice-based Encryption

The main protection against side-channel attacks consists in computing every function with multiple shares via the masking countermeasure. For IND-CCA secure lattice-based encryption schemes, the masking of the decryption algorithm requires the high-order computation of a polynomial comparison. In this paper, we describe and evaluate a number of different techniques for such high-order comparison, always with a security proof in the ISW probing model. As an application, we describe the full high-order masking of the NIST standard Kyber , with a concrete implementation on ARM Cortex M architecture, and a t -test evaluation.


Introduction
Post-quantum cryptography. The most widely used public-key cryptosystems today are based on RSA and ECC, but they are breakable in polynomial time using a quantum computer. While it is currently unknown whether building a scalable quantum computer is feasible or not, the goal of post-quantum cryptography is to design alternatives to RSA and ECC with resistance against quantum attacks. Initiated in 2016, the NIST post-quantum standardization has now entered its final round, with the selection of the Kyber algorithm [BDK + 18,ABD + 21] for general encryption. The security of Kyber is based on the hardness of the module learning-witherrors (M-LWE) problem, which is conjectured to remain hard even with a full-scale quantum computer.
Side-channel attacks and the masking countermeasure. Lattice-based public-key encryption schemes are vulnerable to side-channel attacks as any other cryptosystems, see for example [PPM17,HCY20,XPRO20]. The main countermeasure against side-channel attacks is masking [CJRR99]. It consists in splitting every variable x into n shares with x = x 1 + · · · + x n , and processing the shares separately. Then, an adversary with a limited number of probes cannot learn more than an adversary without probes. The study of protecting circuits against high-order attacks was initiated by Ishai, Sahai and Wagner in [ISW03]. They considered an adversary who can probe at most t wires in a circuit. They showed how to transform any Boolean circuit C into a circuit of size O(|C| · t 2 ) secure against such adversary, using n = 2t + 1 shares. This was later improved by Barthe et al. to n = t + 1 shares only [BBD + 16], who introduced the notions of (Strong) Non-Interference (NI/SNI) to facilitate the writing of security proofs with the composition of gadgets.
The masking countermeasure was initially developed for securing block-ciphers against sidechannel attacks, for example AES in [RP10]. It appears that securing lattice-based schemes against high-order attacks offers quite new and interesting challenges, thanks to the rich algorithmic diversity of post-quantum cryptography. While in principle any algorithm can be written re-encryption, the ciphertext comparison requires checking that every coefficient belongs to a certain public range modulo q. Such a range test is performed via two high-order comparisons, based on arithmetic to Boolean conversions.
Recently, a high-order algorithm for performing the comparison between two ciphertexts was described in [DHP + 22], based on computing linear sums as in [BDH + 21], but by switching to a larger modulus, so that the probability of an incorrect result with a single equality check becomes negligible. As opposed to [BDH + 21], the technique works for both prime and powerof-two moduli q. More precisely, it converts into a larger modulus 2 αp+λ where λ is the security parameter, so that the false positive probability is at most 2 −λ , with α p = 10 for Kyber and Saber. The final zero-testing is performed using arithmetic to Boolean conversion. The complexity is dominated by the cost of the conversion of each coefficient into the larger modulus, which comprises an arithmetic to Boolean conversion (with complexity O(n 2 · k) where k = ⌈log 2 q⌉ is the modulus size), and a Boolean to arithmetic conversion to the larger modulus (with complexity O(n 2 ·λ) for security parameter λ). Asymptotically, the final complexity is then O(ℓ·n 2 (k+λ)) for ℓ coefficients, which is similar to the PolyZeroTestAB approach recalled above (see Table 1 below for a comparison). The authors also described an efficient implementation of their technique, faster in practice than [BGR + 21], and also a verification of the side-channel security of their implementation using concrete leakages.
Our contributions. Our first contribution is to describe two new techniques for performing the high-order ciphertext comparison for a prime modulus q, more efficient than with arithmetic to Boolean conversion. Our first technique is based on converting from arithmetic masking modulo q to multiplicative masking, which enables us to perform a zero-test of x without revealing more information about x. More precisely, starting from the arithmetic shares x i of x = x 1 + · · · + x n (mod q), we first convert into a multiplicative sharing u 1 · · · u n · x = B (mod q). With invertible masks u i ∈ Z * q , we must have B ̸ = 0 if x ̸ = 0, and B = 0 if x = 0, which gives a zero-test of x. We prove that an adversary with at most n − 1 probes does not learn more information about x. For a single coefficient, the complexity is O(n 2 ), instead of O(n 2 · k) with previous approaches. For zero-testing ℓ coefficients, we first apply the technique from [BDH + 21] to reduce to the zerotesting of κ ≪ ℓ coefficients, for κ = ⌈λ/ log 2 q⌉. For zero-testing the remaining κ coefficients all at once, we again use κ linear combinations, but this time with masked coefficients, so that we can eventually zero-test each linear combination separately, using the above method. We describe the corresponding PolyZeroTestMult algorithm in Section 4.1.
Our second algorithm is based on masked exponentiation modulo a prime q, using Fermat's little theorem. For zero-testing a single arithmetically masked coefficient x, we high-order compute b = 1 − x q−1 mod q, which gives b = 1 if x = 0, and b = 0 if x ̸ = 0, as required. With a square-and-multiply, the complexity is O(n 2 · log q). For zero-testing ℓ coefficients, as previously we first reduce to the zero-testing of κ ≪ ℓ coefficients y (j) . For zero-testing the remaining κ coefficients y (j) all at once, we high-order compute the product modulo q of the corresponding bits b (j) = 1 − (y (j) ) q−1 mod q, which gives b = 1 if all coefficients are zero, and b = 0 otherwise, as required. We describe the corresponding PolyZeroTestExpo algorithm in Section 4.2. We refer to Table 1 below for a summary. We show that in practice, both techniques are more efficient than arithmetic to Boolean conversion (see Section 3.3).
As an application, our second contribution is to improve the efficiency of the high-order polynomial comparison in Kyber for IND-CCA decryption (Step 3). Recall that in Kyber the ciphertext coefficients are compressed from modulo q to d bits by computing the function PolyZeroTestExpo Exponentiation SecMult mod q O(ℓκn + κn 2 log q) Table 1: Complexities of polynomial comparison, with ℓ coefficients and n shares, and a modulus 2 k or a k-bit prime q. We write κ = ⌈λ/ log 2 q⌉, where λ is the security parameter.
Compress q,d (x) := (2 d /q) · x mod 2 d . We first consider an alternative approach to [BGR + 21], where we explicitly high-order mask the Compress function during the encryption process. For this we extend the modulus switching technique from [FBR + 21] which was first-order only. To our knowledge, this is the first proposal for high-order masking the Compress function of Kyber 3 . We also consider the high-order ciphertext comparison without the Compress function as in [BGR + 21], and we provide an alternative, faster technique when the output size of Compress is close to the bitsize of q, which is the case for 3/4 of the ciphertext coefficients in Kyber.
Finally, we show that the best strategy for polynomial comparison in Kyber is hybrid: for the first part of the ciphertext, we do not apply the Compress function and perform the comparison over uncompressed ciphertexts (as in [BGR + 21], but with our faster algorithm), while for the second part of the ciphertext, we high-order compute the Compress function and perform the comparison over Boolean shares. Finally, we provide a detailed description of the masking of the full IND-CCA decryption of the Kyber scheme at any order. We also describe the practical results of a C implementation of the full high-order masking of Kyber and Saber. The source code is public and can be found at https://github.com/fragerar/HOTableConv/tree/main/Masked_KEMs Finally, we have also performed a t-test evaluation using the ChipWhisperer platform for power traces.
Follow-up works. Very recently, [DBV22] described a variant of our hybrid method for the ciphertext comparison in Kyber, in which for the first part of the ciphertext, the results of the comparison over uncompressed coefficients are converted from arithmetic to Boolean masking. While the asymptotic complexity remains the same, for a bitsliced implementation, the authors obtain a 25% speed-up factor compared to our hybrid method. Very recently, the authors of [BC22] also described a bitsliced implementation of Kyber and Saber, starting from our implementation of the hybrid method, and achieved significant performance improvement.

Notations and security definitions
For any positive integer q, we define r ′ = r mod q to be the unique element r ′ in the range [0, q[ such that r ′ = r (mod q). For an even (resp. odd) positive integer q, we define r ′ = r mod ± q to be the unique element r ′ in the range −q/2 < r ′ ≤ q/2 (resp. −(q − 1)/2 ≤ r ′ ≤ (q − 1)/2) such that r ′ = r (mod q). For x ∈ Q, we denote by ⌊x⌉ the rounding of x to the nearest integer, with ties being rounded up. We denote by x ≫ k the shifting of an integer x with k positions to the right, that is ⌊x/2 k ⌋.
We recall below the NI/SNI definitions introduced in [BBD + 16]. Those definitions are quite convenient as they allow the easy composition of gadgets. One can then focus on proving the NI/SNI property for individual gadgets, and the security of the full circuit will follow by composition. The SNI definition is stronger than NI in that the number of input shares required for the simulation only depends on the number of internal probes, and not on the number of output shares that must be simulated. If a gadget only satisfies the NI definition, usually this is not a problem as we can apply some SNI mask refreshing as output and the resulting gadget becomes SNI (see [BBD + 16]). In this paper all our gadgets will be proven either NI or SNI.
Definition 1 (t-NI security). Let G be a gadget taking as input (x i ) 1≤i≤n and outputting the vector (y i ) 1≤i≤n . The gadget G is said t-NI secure if for any set of t 1 ≤ t intermediate variables, there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables can be perfectly simulated from x |I .
Definition 2 (t-SNI security). Let G be a gadget taking as input n shares (x i ) 1≤i≤n , and outputting n shares (z i ) 1≤i≤n . The gadget G is said to be t-SNI secure if for any set of t 1 probed intermediate variables and any subset O of output indices, such that t 1 + |O| ≤ t, there exists a subset I of input indices that satisfies |I| ≤ t 1 , such that the t 1 intermediate variables and the output variables z |O can be perfectly simulated from x |I .
Note that for masking the IND-CCA decryption, when performing the comparison between two ciphertexts c and c ′ , the output bit b of the comparison must eventually be computed in the clear, which means that the n shares b i of b must eventually be recombined. For this we use the extended notion of NI security from [BBE + 18, Definition 7], in which the output b of the gadget is given to the simulator.
Definition 3 (t-NIo security [BBE + 18]). Let G be a gadget taking as input (x i ) 1≤i≤n and outputting b. The gadget G is said t-NIo secure if for any set of t 1 ≤ t intermediate variables, there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables can be perfectly simulated from x |I and b.
To satisfy this definition, one can use the same approach as in [Cor14] for recombining the output shares of a block-cipher to output the ciphertext: one performs a sequence of n mask refreshing, each of complexity O(n), so that the share recombination can be perfectly simulated, knowing the output bit b. Namely, this output bit b of the comparison can be given for free to the simulator, since that bit b is eventually known by the adversary. Following [Cor14], this enables to prove the (n − 1)-NIo property of the share recombination algorithm; see Appendix B.1 for more details.
In Appendix B.2, we describe a slightly more efficient approach, still with complexity O(n 2 ), but using only half the randomness.

High-order zero testing
In the IND-CCA decryption of lattice-based schemes such as Kyber, according to the Fujisaki-Okamoto transform, we must perform a comparison between the input ciphertextc, and the 6 re-encrypted ciphertext c. In the context of the masking countermeasure, the re-encrypted ciphertext c is masked with n shares, so we must perform this comparison over arithmetic or Boolean shares. Moreover, the coefficients of the polynomialsc and c must be compared all at once. Otherwise the leaking of partial comparison results can leak information about the secret key, as demonstrated in [BDH + 21].
In this section, for simplicity, we consider the zero-testing of a single coefficient. We will then show in Section 4 how to test multiple coefficients at once. With arithmetic shares, comparing two individual coefficients x and y in Z q is equivalent to zero testing x − y ∈ Z q . Similarly, with Boolean shares, comparing two coefficients x, y ∈ {0, 1} k is equivalent to zero testing x ⊕ y. Therefore, in the rest of this section, we focus on zero-testing.
For a single coefficient x, we are therefore given as input the n Boolean shares of x = x 1 ⊕ · · · ⊕ x n ∈ {0, 1} k , or the n arithmetic shares of x = x 1 + · · · + x n mod q, and we must output a bit b, with b = 1 if x = 0 and b = 0 if x ̸ = 0, without revealing more information about x. This means that an adversary with at most t = n − 1 probes will learn nothing about x, except if x = 0 or not. For the security proof, the simulation technique is the same as for security proofs in the ISW probing model, except that the output bit b is additionally given to the simulator (see Section 2).
From Boolean shares over {0, 1} k , one can perform a zero-test with complexity O(n 2 · log k); we recall the technique in Appendix C.3 (ZeroTestBoolLog algorithm). From arithmetic shares modulo q, the simplest technique is to first perform an arithmetic to Boolean conversion, and then apply the zero-testing on the Boolean shares. The complexity is O(n 2 k) for a k-bit modulus. We recall the technique in Appendix C.5 (ZeroTestAB algorithm). In Appendix C.6, we also describe an alternative zero-testing for arithmetic masking based on the generic table recomputation approach from [CGMZ22], with the register optimization. In that case the countermeasure has complexity O(n 2 ) only, assuming that we have access to 2 k -bit registers. Therefore this optimization can only work for small k, say up to k = 8.
New zero-testing gadgets. For arithmetic shares modulo a prime q, we describe two new zero-testing algorithms, more efficient than the state of the art. The first technique (ZeroTestMult in Section 3.1) is based on converting from arithmetic masking to multiplicative masking, so that one can distinguish between x = 0 and x ̸ = 0, without revealing more information about x. The second technique (ZeroTestExpo in Section 3.2) is based on Fermat's theorem and consists in high-order computing b = 1 − x q−1 mod q, which gives b = 1 if x = 0, and b = 0 if x ̸ = 0, as required. We will see in Section 4 that for zero-testing ℓ coefficients at once, these two techniques are more efficient than arithmetic to Boolean conversion. We refer to Table 2 for a summary.

Technique
Masking Complexity Table 2: Complexities of zero testing a single value with n arithmetic shares, and a modulus 2 k or a k-bit prime q.
For the ciphertext comparison in Kyber, we will describe in Section 5 a hybrid approach in which the first part of the re-encrypted ciphertext is arithmetically masked modulo q, while the remaining part is Boolean masked. Therefore, we will use the ZeroTestBoolLog algorithm for the second part, and for the first part either ZeroTestMult or ZeroTestExpo, which offer similar performances on Kyber. For Saber, the re-encrypted ciphertext is completely Boolean shared, so we will use ZeroTestBoolLog. Finally, the ZeroTestAB algorithm will not be used in our constructions, but we keep this algorithm anyway for comparison.

Zero testing modulo a prime q via multiplicative masking
Our technique works for prime q only. It is based on converting from arithmetic masking modulo q to multiplicative masking. When the secret value x is 0, the multiplicatively masked value remains 0, whereas for x ̸ = 0, we obtain a random non-zero masked value. This enables us to distinguish the two cases, without leaking more information about x.
More precisely, given as input the shares x i of x = x 1 + · · · + x n (mod q), we convert the arithmetic masking into a multiplicative masking. For this we generate a random u 1 ∈ Z * q and we compute: by computing the corresponding shares x ′ i = u 1 · x i mod q for all 1 ≤ i ≤ n. We then perform a linear mask refreshing of the arithmetic shares x ′ i . Such linear mask refreshing is not SNI but it is NI. Moreover, its property is that any subset of n − 1 output shares is uniformly and independently distributed, as in the mask refreshing from [RP10].
We proceed similarly with the multiplicative shares u 2 , . . . , u n ∈ Z * q . Eventually we obtain an arithmetic sharing (B i ) 1≤i≤n satisfying: u 1 · · · u n · x = B 1 + · · · + B n (mod q) Thanks to the n multiplicative shares u i , we can now safely decode the arithmetic sharing (B i ) 1≤i≤n without revealing more information about x. More precisely, we compute B = B 1 + . . . + B n (mod q), and we obtain: Recall that u i ∈ Z * q for all 1 ≤ i ≤ n. Therefore if x ̸ = 0, we must have B ̸ = 0, and if x = 0, we have B = 0. This gives a zero-test of x. We provide below a pseudocode description of the corresponding ZeroTestMult algorithm. We recall the LinearRefreshMasks algorithm in Appendix B.1.
Note that we obtain a bit b directly in the clear. This means that when zero testing multiple coefficients at once, we cannot keep an n-shared bit b and high-order combine the results of individual zero-testing. Therefore, to test multiple coefficients at once, we will have to proceed differently (see Section 4).
Complexity. For simplicity we ignore the reductions modulo q in the operation count. The complexity of LinearRefreshMask is 3(n − 1) operations. We obtain: T ZeroTestMult (n) = n · (1 + n + 3(n − 1)) + n = n · (4n − 1) ≃ 4n 2 The technique has therefore complexity O(n 2 ) for a single coefficient. That is, as opposed to the ZeroTestAB algorithm, the complexity is independent from the size of the modulus q, assuming that arithmetic operations in Z q take unit time. We will see in Section 3.3 that for zero testing a single coefficient, the technique is much faster.

Algorithm 1 ZeroTestMult
Input: x 1 , . . . , x n ∈ Z q for prime q. Output: b ∈ {0, 1} with b = 1 if i x i = 0 (mod q) and b = 0 otherwise 1: (B 1 , . . . , B n ) ← (x 1 , . . . , x n ) 2: for j = 1 to n do 3: (B 1 , . . . , B n ) ← LinearRefreshMasks(B 1 , . . . , B n ) 6: end for 7: B ← B 1 + · · · + B n mod q 8: if B = 0 then return 1 9: else return 0 Security. The following theorem shows that the adversary does not get more information than whether x = 0 or not. The argument is as follows: if the adversary has at most n − 1 probes, then at least one multiplication by u i ∈ Z * q and subsequent mask refreshing has not been probed. In that case, all output shares of the corresponding mask refreshing can be perfectly simulated, knowing the output bit b. Namely if x ̸ = 0, the output shares must encode a random element in Z * q (thanks to the multiplication by the random u i ∈ Z * q which has not been probed), and if x = 0, the output shares are an encoding of 0. In both cases, since by assumption the mask refreshing has not been probed, we can provide a perfect simulation of all output shares of the mask refreshing, which is easily propagated to the end of the algorithm, and eventually the recombination of the shares and the bit b. We provide the proof in Appendix C.9.
Theorem 1 ((n−1)-NIo of ZeroTestMult). The ZeroTestMult takes as input n arithmetic shares x i for 1 ≤ i ≤ n and outputs a bit b with b = 1 if n i=1 x i = 0 (mod q) and b = 0 otherwise. Any t probes can be perfectly simulated from x |I and b, with |I| ≤ t.

Zero testing modulo a prime q via exponentiation
Our second technique also works for prime q only. It consists in computing By Fermat's little theorem, we obtain b = 1 if x = 0 (mod q) and b = 0 otherwise, as required. Given as input the shares x i of x = x 1 + · · · + x n (mod q), the exponentiation x q−1 mod q in (1) can be computed with a square-and-multiply, using a sequence of high-order multiplications modulo q. Eventually we obtain an arithmetic sharing of b = b 1 + · · · + b n (mod q), and we recombine the shares to get the bit b. The complexity of each high-order multiplication modulo q is O(n 2 ) for n shares. Hence the complexity is O(n 2 ·log q), assuming that arithmetic operations modulo q take unit time.
We recall in Appendix C.7 the secure multiplication algorithm SecMult, already considered in [SPOG19]. We then provide in Appendix C.8 the pseudo-code of the ZeroTestExpo algorithm computing the bit b as in (1). We provide the proof of the following theorem in Appendix C. 10.

Comparison of zero-test algorithms
We provide below a comparison of the 3 zero-test algorithms that work modulo q, with q = 3329 as in Kyber. We see in Table 3 that for testing a single value, ZeroTestMult is more than one order of magnitude faster than ZeroTestAB and ZeroTestExpo.

High-order polynomial comparison
In this section we extend the zero-testing techniques from Section 3 to multiple coefficients all at once. We refer to Table 1 in Section 1 for a summary of the resulting algorithms and their complexity.
To zero-test a set of Boolean masked coefficients (PolyZeroTestBool), we simply perform a sequence of high-order Ands of the complement in {0, 1} k , followed by a final zero-testing to get a single bit. The approach is the same for zero testing multiple coefficients arithmetically masked modulo 2 k or a prime q, thanks to arithmetic to Boolean conversion (PolyZeroTestAB). We describe the two algorithms in appendices D.1 and D.2.
When working modulo a prime q, it is very advantageous to first apply the technique from [BDH + 21] that reduces the zero-testing of ℓ coefficients to the zero-testing of κ ≪ ℓ coefficients. If all coefficients are 0, each linear combination will be 0. If at least one of the coefficients is nonzero, the linear combination will be non-zero, except with probability 1/q. Therefore, by using κ linear combinations, we can decrease the error probability to q −κ . To get error probability lower than 2 −λ , we can therefore take κ = ⌈λ/ log 2 q⌉ for security parameter λ. We recall the technique in Appendix D.3. The main advantage is that coefficients of the linear combinations can be computed in the clear, which implies that the complexity of this first step is only O(n).
The remaining κ coefficients must then be zero tested all at once. For this one can use the zero-testing based on multiplicative masking (ZeroTestMult), or the zero-testing based on exponentiation (ZeroTestExpo). When using the ZeroTestExpo algorithm, we keep the resulting bit of each individual zero-test in shared form, so that we can combine them by high-order multiplication using the SecMult algorithm. Eventually we recombine the shares to get the result of the global zero-test.
However, when the zero-testing is based on multiplicative masking (ZeroTestMult), we obtain the bit b of an individual zero-testing in the clear, so we must proceed differently. Before applying the zero-testing, we first compute random linear combinations as in [BDH + 21], but this time the coefficients of the linear combination must be masked with n shares. For each linear combination, we perform a zero-test of the result. As previously, by repeating the procedure κ times, we can decrease the error probability to 2 −λ with κ = ⌈λ/ log 2 q⌉.
These last two methods (PolyZeroTestExpo and PolyZeroTestMult) both work modulo a prime q only, so it is interesting to compare their complexities (see Table 1). We see that the exponentiation method is faster for small q, when log 2 q ≪ √ λ. Otherwise, the multiplicative masking method is faster. For the Kyber scheme, with q = 3329 and targeting λ = 128 bits of security, we expect the two methods to have a similar level of efficiency, and in practice their running time is surprisingly close.

Polynomial comparison modulo q via multiplicative masking
As explained previously, when zero testing a value x modulo q using the multiplicative masking technique (Section 3.1), we obtain the resulting bit b in the clear, so we cannot zero-test the coefficients iteratively. Instead, we first compute a random linear combination of the individual coefficients modulo q, and we then perform a zero-test of the result. This approach is similar to [BDH + 21], except that we must compute the coefficients a (j) in the linear combination in n-shared form, as otherwise this can leak information on the input coefficients and then cause a CCA attack.
As previously, we consider as input an arithmetic masking of ℓ coefficients x (j) , that is n (mod q) for all 1 ≤ j ≤ ℓ. We first apply the reduction algorithm from ℓ to κ coefficients (see Appendix D.3). In the second step, we must therefore zero-test the set of coefficients y (j) with arithmetic shares y For this, we generate random coefficients a (j) ∈ Z q , and we high-order compute the linear combination: If y (j) = 0 for all 1 ≤ j ≤ κ, then z = 0. If y (j) ̸ = 0 for some 1 ≤ j ≤ κ, then we have z ̸ = 0, except with probability 1/q. We can therefore perform a zero-test of z. The procedure can be repeated a small number of times to have a negligible probability of error. Namely, for κ repetitions with randomly generated a (j) , the error probability becomes q −κ . Equation (2) is high-order computed using the arithmetic shares y (j) i of the coefficients y (j) . Similarly the random coefficients a (j) are generated via n random shares a (j) i in Z q . This is the main difference with the linear combination used in Appendix D.3 for the reduction step, in which the coefficients are computed in the clear. We stress that this time, the coefficients a (j) must be computed in n-shared form, and the multiplication a (j) · y (j) computed with SecMult. Namely, the zero-testing is applied on each linear sum, so without masking the a (j) 's, an equation over the y (j) 's could be leaked with fewer than n probes.
From the high-order computation of (2), we obtain the n shares z i of a linear combination z. We then apply the zero-test procedure from Section 3.1 on the shares z i , which outputs a bit b such that b = 1 if z = 0 and b = 0 otherwise. The procedure is repeated κ times, and if we always obtain b = 1 from the zero-test, we output 1, otherwise we output 0. We provide in Appendix D.4 a pseudocode description of the corresponding algorithm PolyZeroTestMult.
For security level λ, the error probability must satisfy q −κ ≤ 2 −λ , so we can take κ = ⌈λ/ log 2 q⌉ repetitions. In the second step, taking as input κ coefficients, each linear sum computation and final zero-test has complexity O(κn 2 ). Therefore, the complexity of the second step is O(κ 2 n 2 ). The total complexity is therefore O(κℓn + κ 2 n 2 ). We refer to Appendix D.4 for a more precise operation count. Theorems 3 and 4 below prove the soundness and security of the algorithm respectively; we refer to Appendix D.5 and D.6 for the proofs.
Theorem 3 (Soundness). The PolyZeroTestMult outputs the correct answer, except with probability at most q −κ .

Polynomial comparison modulo q via exponentiation
In this section, we extend the technique from Section 3.2 to zero test multiple coefficients at once with the exponentiation method. As previously, we are given as input ℓ coefficients x (j) ∈ Z q with arithmetic shares x (j) i modulo a prime q, and we must output a bit b = 1 if x (j) = 0 for all 1 ≤ j ≤ ℓ, and b = 0 otherwise. We first apply the reduction algorithm, so there remains only κ coefficients y (j) to be zero-tested, from their arithmetic shares y To perform a zero test of all coefficients y (j) at once, we high-order compute: and we obtain b = 1 if y (j) = 0 for all 1 ≤ j ≤ ℓ, and b = 0 otherwise, as required. As previously, Equation (3) can be securely computed by a sequence of high-order multiplication SecMult, using as input the arithmetic shares y (j) i modulo q. The shares of b are only recombined at the end, so that an adversary with at most t = n − 1 probes does not learn more than the bit b. Since the complexity of a single SecMult is O(n 2 ), the complexity for κ coefficients is O(κn 2 log q), and the total complexity is therefore O(ℓκn + κn 2 log q).
We provide in Appendix D.7 a pseudocode description of the corresponding PolyZeroTestExpo algorithm. The proof of the following theorem is straightforward and is therefore omitted.

Comparison of polynomial zero-testing
We compare in Table 4 below the operation count between three polynomial comparison techniques. For PolyZeroTestAB we work modulo 2 k with k = 13, while for PolyZeroTestExpo and PolyZeroTestMult we work modulo q = 3329. We see that both PolyZeroTestExpo and PolyZe-roTestMult are much faster than PolyZeroTestAB. This is because for a large number of coefficients ℓ, the asymptotic complexity of PolyZeroTestExpo and PolyZeroTestMult is O(ℓ · n), instead of O(ℓ · n 2 ) for PolyZeroTestAB. Namely, the reduction to zero testing κ ≪ ℓ coefficients with complexity O(n) only works for a prime modulus q. We have also performed a C implementation that confirms these results, see Table 5 below.  Table 4: Operation count for polynomial zero testing with arithmetic masking modulo q, with n = t + 1 shares, ℓ = 768 coefficients, in thousands of operation, with q = 2 13 for PolyZeroTestAB and q = 3329 for PolyZeroTestExpo and PolyZeroTestMult. We use κ = 11, in order to reach 128 bits of security.  Table 5: Running time in thousands of cycles for a C implementation on Intel(R) Core(TM) i7-1065G7, for the same parameters as in Table 4.  Table 6: Number of calls to the rand() function (outputting a 32-bit value), for the same parameters as in Table 4.

Polynomial comparison for Kyber
In this section, we focus on the polynomial comparison in Kyber [BDK + 18]. We will recall in Section 6 the full Kyber algorithm, and then describe a complete high-order masking of Kyber.
Recall that computations in Kyber are performed in R q = Z q [X]/(X N + 1) with N = 256 and q = 3329. To reduce the ciphertext size, the coefficients of the ciphertext are compressed from modulo q to d bits using the function: and are decompressed using the function Decompress q,d (c) := (q/2 d ) · c , with d = d u = 10 for the first part of the ciphertext, and d = d v = 4 for the second part, according to the Kyber768 parameters (see Table 7   In the IND-CCA decryption algorithm based on the Fujisaki-Okamoto transform [FO99], we must perform a polynomial comparison between two compressed ciphertexts: the input ciphertextc, and the re-encrypted ciphertext c. The Compress q,d function is applied coefficient-wise, so for simplicity we first consider a single coefficient. Let x be the re-encrypted coefficient modulo q before compression, and let c be the resulting compressed coefficient, that is c = Compress q,d (x). We must therefore perform the comparison with the input ciphertextc modulo 2 d : There are two possible approaches to perform this comparison. The first approach consists in performing the comparison as in (4). Since the re-encrypted coefficient x is arithmetically masked modulo q, we show how to high-order compute Compress q,d (x) with arithmetically masked input modulo q, and Boolean masked output in {0, 1} d . We can then perform the high-order polynomial comparison over Boolean shares, using the PolyZeroTestBool algorithm (see Appendix D.1). We describe in Section 5.1 such high-order computation of the Compress function.
A second approach is to avoid the computation of the Compress function, as in [BGR + 21]. Namely instead of performing the comparison over {0, 1} d as in (4), one can equivalently compute the set of possible candidatesx i such thatc = Compress q,d (x i ). One must then determine whether the re-encrypted coefficient x is equal to one of the (public) candidatesx i , using the n arithmetic shares of x modulo q. Our contribution compared to [BGR + 21] is to describe an alternative, faster technique when the number of candidatesx i is small, which is the case for d = d u = 10 (see Section 5.2).
Finally, we argue that the best approach is hybrid: for the first ℓ 1 = 768 coefficients of the ciphertext with d u = 10, we do not compute the Compress function and apply our faster technique for the small number of candidatesx i , and for the remaining ℓ 2 = 256 coefficients with d v = 4, we high-order compute the Compress function. We describe this hybrid approach in Section 5.3.

High-order computation of the Compress function
We provide the first description of the high-order computation of the Compress function of Kyber. Our technique can be seen as a generalization of the first-order technique of [FBR + 21], based on modulus switching: it consists in first using more precision, so that the error induced by the modulus switching can be completely eliminated, after a logical shift.
The Compress function is defined as: We are given as input an arithmetic sharing of x = x 1 +. . .+x n (mod q) and we want to compute a Boolean sharing of y = Compress q,d (x) = y 1 ⊕ · · · ⊕ y n ∈ {0, 1} d . Note that in [BGR + 21], the authors only described the high-order masking of the Compress function with 1-bit output, which corresponds to the IND-CPA decryption function of Kyber. Here we high-order mask Compress for any number of output bits d (for example d = d u = 10 or d = d v = 4 in Kyber768). For the special case d = 1 there are more efficient techniques, see for example [BGR + 21] and [CGMZ22]. We proceed as follows. We first perform a modulus switching of the input coefficients x i but with more precision; that is we work modulo 2 d+α for some parameter α > 0 and compute: The rounding can be computed by writing: which is the quotient of the Euclidean division of x i · 2 d+α+1 + q by 2q.
We then perform an arithmetic to Boolean conversion of the arithmetic shares z 1 , . . . , z n , followed by a logical shift by α bits. This can be done with complexity O((d + α) · n 2 ) using [CGV14]. By definition we obtain: and eventually we output the Boolean shares y 1 , . . . , y n . We show below that we indeed have Compress q,d (x) = y 1 ⊕ · · · ⊕ y n as required, under the condition 2 α > q · n. This condition determines the number α of bits of precision as a function of the number of shares n. We provide the pseudocode in Algorithm 2 below.
Security. The following theorem shows that the HOCompress achieves the (n − 1)-NI property. The proof follows from the (n − 1)-NI property of the ArithmeticToBoolean algorithm, and the fact that the perfect simulation of z i requires the knowledge of the input x i only.
Polynomial comparison with Compress. Recall that we must perform the comparisonc ? = Compress q,d (x), where for simplicity we consider a single coefficientc. By applying the HOCompress algorithm, we obtain n Boolean shares such that c = c 1 ⊕ · · · ⊕ c n . We must therefore zero-test the value (c 1 ⊕c) ⊕ c 2 ⊕ · · · ⊕ c n , which can be done using the ZeroTestBool algorithm from Appendix C.3.
For multiple coefficients, we apply the HOCompress algorithm separately on each coefficient x (j) of the re-encrypted uncompressed ciphertext. We obtain the compressed ciphertext c masked with n Boolean shares. As previously, we xor each coefficient of the input ciphertextc with the first share of the corresponding coefficient in c, and we apply the PolyZeroTestBool algorithm from Appendix D.1 to perform the comparison.

Polynomial comparison for Kyber without Compress
In this section we consider an alternative approach for ciphertext comparison, already used in [BGR + 21], that performs the comparison on uncompressed ciphertexts, in order to avoid the high-order computation of the Compress q,d (x) function as above. Under this approach, we describe a more efficient technique for ciphertext comparison for d = d u = 10, which is the case for 3/4 of the polynomial coefficients in Kyber.
For simplicity, we first consider a single polynomial coefficient. Given a compressed input ciphertextc and an uncompressed re-encrypted ciphertext x, we must check thatc = Compress q,d (x), where x is arithmetically masked with n shares modulo q. For this we use the equivalence: Givenc as input, we must therefore compute the (public) list of candidates Compress −1 q,d (c), which corresponds to a certain interval in Z q , and check whether x belongs to this interval. For this, the authors of [BGR + 21] describe a high-order algorithm that performs two high-order comparisons with the interval bounds. We recall the corresponding RangeTestShares algorithm in Appendix E.3, with the pseudo-code and the operation count.
However, we observe that when the number of candidates is small (which is the case for d = d u = 10), it is more efficient to perform individual comparisons. More precisely, letting Recall that x is arithmetically masked with n shares modulo q. Therefore we can high-order compute z = m i=1 (x −x i ) mod q and then apply a high-order zero-test of z modulo q. We describe the technique below.
Computing the set of candidates. Given a compressed coefficientc, we must compute the list of candidates Compress −1 q,d (c). While such preimage can be easily tabulated for all its 2 d possible inputs (as done in [BGR + 21]), in the following we describe a concrete algorithm. This can be useful for embedded applications with limited memory.
From [BDK + 18], we know that for any x ∈ Z q such thatc = Compress q,d (x), letting y = Decompress q,d (c) = (q/2 d ) ·c we must have: Therefore the number of candidates is upper-bounded by 2B q,d + 1; see Table 8 for the value of the upper-bound, and the maximum number of candidates, for q = 3329.  The following lemma shows that there are always at least 2B q,d − 1 candidates around the decompressed value y, with possibly 2 additional candidates. We can then test these 2 candidates by applying the Compress function. Given 0 ≤ a < b < q, we denote by [a, b] q the discrete interval {a, a + 1, . . . , b}; similarly given 0 ≤ b < a < q, we denote by [a, b] q the discrete interval {a, a + 1, . . . , q − 1, 0, 1, . . . , b}. In Appendix E.1, we provide the proof of the following Lemma, from which we derive a concrete algorithm.
Individual comparisons. We see in Table 8 that the number of candidates is small for d = 10, 11, so we describe an alternative algorithm to the range test performed in [BGR + 21], based on individual comparisons. Letting {x 1 , . . . ,x m } = Compress −1 q,d (c) be the list of candidates, we must test whether x =x i for some 1 ≤ i ≤ m. For this, given an arithmetically masked x with n shares with x = x 1 + · · · + x n (mod q), we high-order compute the value and we have that z = 0 (mod q) if and only if x =x i for some 1 ≤ i ≤ m. We provide in Appendix E.2 the pseudocode description of the corresponding SecMultList algorithm, and its proof of security against t = n − 1 probes. As a second step, one can then apply a high-order zero-test of z modulo q, either the ZeroTestMult algorithm from Section 3.1, or the ZeroTestExpo algorithm from Section 3.2.
The above applies for a single coefficient x. In reality we must compare ℓ coefficients, so for each coefficient x (j) whose compressed value must be compared to the coefficientc j of the input ciphertextc, we compute the corresponding list of candidates fromc j , and then the corresponding arithmetically masked z (j) , which must all be equal to 0. Therefore a polynomial zero-test is applied to the set of arithmetically masked z (j) 's modulo q, either the PolyZeroTestMult algorithm from Section 4.1, or the PolyZeroTestExpo algorithm from Section 4.2.

Ciphertext comparison in Kyber: hybrid approach
We first compare in Table 9 the efficiency of the approaches with and without Compress. Since the number of candidates is small for the coefficients with compression parameter d u = 10 bits, without using the Compress function we can use either the RangeTestShares algorithm from [BGR + 21], or our SecMultList algorithm. We see in Table 9 that the latter is significantly faster. It is also faster than applying the Compress function with our HOCompress algorithm. On the other hand, for the coefficients with compression to d v = 4 bits, without using the Compress function, one must use the RangeTestShares algorithm from [BGR + 21]. But we see that our HOCompress is nevertheless faster. The reason is that it uses only a single arithmetic to Boolean conversion with a power-of-two modulus, whereas RangeTestShares uses two arithmetic to Boolean conversions, moreover modulo q, which is more costly than with power-of-two moduli.
In summary, from Table 9, we deduce that for d = d u = 10, our SecMultList approach without Compress is faster, while for d = d v = 4, our HOCompress algorithm is faster. Therefore, to perform the ciphertext comparison in Kyber, we use a hybrid approach, applying the Compress function only for the last ℓ 2 = 256 coefficients of the ciphertext, for which d = d v = 4.  Procedure for ciphertext comparison. We summarize our hybrid approach below. Recall that for masking the IND-CCA decryption of Kyber, we must perform a comparison between the unmasked input ciphertextc, and the masked re-encrypted ciphertext c. Moreover, with the Kyber768 parameters, a ciphertext consists of 4 polynomials with 256 coefficients each. The coefficients of the first 3 polynomials are compressed with d u = 10 bits, while the coefficients of the last polynomial are compressed with d v = 4 bits. Starting from the re-encrypted uncompressed ciphertext c u which is masked modulo q, and given the input ciphertextc, we proceed as follows: 1. For each of the first ℓ 1 = 768 coefficients of c u , with compression parameter d u = 10, we use the individual comparison technique from Section 5.2 (Algorithm SecMultList). We obtain a set of values z (j) arithmetically masked modulo q, that must all be equal to 0, for 1 ≤ j ≤ ℓ 1 . 2. For each of the last ℓ 2 = 256 coefficients of c u , we apply the HOCompress algorithm with d v = 4 bits. We obtain a set of ℓ 2 coefficients c (j) for 1 ≤ j ≤ ℓ 2 , which are Boolean masked with n shares. 3. We xor each of the last ℓ 2 coefficients of the input ciphertextc to the first Boolean share of each of the corresponding ℓ 2 coefficients c (j) . This gives a vector of ℓ 2 coefficients x (j) for 1 ≤ j ≤ ℓ, which are Boolean masked with n shares, and that must all be equal to 0. 4. We apply the PolyZeroTestBool algorithm ( Alg. 21) to the set of ℓ 2 coefficients x (j) , but without recombining the shares at the end of the ZeroTestBoolLog algorithm. More precisely, we obtain Boolean shares b i for 1 ≤ i ≤ n, with b ′ = b 1 ⊕ · · · ⊕ b n and b ′ = 1 if the ℓ 2 coefficients x (j) are zero, and b ′ = 0 otherwise. 5. We take the complement of b ′ by taking the complement of b 1 , and convert the result from Boolean to arithmetic masking modulo q, using the table-based algorithm from [CGMZ22].
We obtain an additional coefficient z (ℓ 1 +1) arithmetically masked modulo q, and that must be equal to 0. 6. Finally, we perform a zero-test of the ℓ 1 + 1 coefficients z (i) for 1 ≤ i ≤ ℓ 1 + 1, using either the PolyZeroTestExpo or the PolyZeroTestMult algorithm. We obtain a bit b = 1 if the two ciphertexts are equal, and b = 0 otherwise, as required.

Operation count and concrete running time
We provide in Table 10 a comparison of the operation count for the ciphertext comparison in Kyber, first using the approach from [BGR + 21] without Compress, and then our hybrid approach described in the previous section, with either the PolyZeroTestExpo or the PolyZeroTestMult methods. We see that the hybrid approach is significantly faster, especially for high security orders. We have also performed a C implementation that confirms these results, see Table 11 below.   The three parameter sets share the common parameters N = 256, q = 3329 and η 2 = 2, while the security level is defined by setting the module rank k = 2, 3, 4, and the parameters η 1 , d t , d u and d v (see Table 7). We refer to Appendix F for an overview of ring-LWE encryption [LPR10]. In the following, we first recall the definition of the Kyber scheme. We then describe the evaluation of the Kyber decapsulation mechanism, secure at any order, using the techniques from the previous sections. . For a modulo rank k, we use a public random k × k matrix A with elements in R q . We set χ η as the centered binomial distribution with support {−η, . . . , η}, and extended to the distribution of polynomials of degree N with entries independently sampled from χ η . The public-key is t = A·s+e ∈ R k q and the secret key is s, where s, e ← χ k η 1 for some parameter η 1 . To CPA-encrypt a message m ∈ R with binary coefficients, one computes (c 1 , c 2 ) ∈ R k q × R q such that c 1 = A T · r + e 1 and c 2 = t T · r + e 2 + ⌊q/2⌉ · m, where r ← χ k η 1 , e 1 ← χ k η 2 and e 2 ← χ η 2 , for some parameter η 2 . To decrypt a ciphertext (c 1 , c 2 ), one computes:

Polynomial comparison
Kyber instantiates the M-LWE-based encryption scheme above with N = 256 and a prime q = 3329; see Table 7 for the other parameters. We recall the pseudo-code from [BDK + 18] below. For simplicity we omit the NTT transform for fast polynomial multiplication. The NTT is indeed a linear operation, so it is easily masked with arithmetic masking modulo q.
Note that the Kyber.Decaps algorithm does not output ⊥ for invalid ciphertexts, as originally in the FO transform. Instead, it outputs a pseudo-random value from the hash of a secret seed z and the ciphertext c. This variant of the FO transform was proven secure in [HHK17]. However, the variant remains secure even if the adversary is given the result of the ciphertext comparison, under the condition that the IND-CPA scheme is γ-spread, which essentially means that ciphertexts have sufficiently large entropy (see [HHK17]), which is the case in Kyber. Therefore, in the high-order masking of Kyber, the bit b of the comparison can be computed in the clear (as in [BGR + 21]), because for the simulation of the probes the bit b can be given for free to the simulator.

High-order masking of Kyber
We describe the high-order masking of the Kyber.Decaps algorithm recalled above (Algorithm 7), using the techniques from the previous sections. 1. We consider Line 1 of Algorithm 7, with the IND-CPA decryption as the first step. We assume that the secret key s ∈ R k is initially masked with n shares, with s = s 1 + · · · + s n (mod q), where s i ∈ (R q ) k for all 1 ≤ i ≤ n. Therefore, at Line 3 of the Kyber.CPA.Dec algorithm, we obtain a value v − s T u that is arithmetically n-shared modulo q. We must therefore compute the Compress q,1 function on this value. For this we use the modulus switching and table recomputation technique from [CGMZ22], which outputs a Boolean masked message m ′ = m 1 ⊕ · · · ⊕ m n = th(v − s T u).

At
Line 2 of Algorithm 7, starting from the Boolean masked m ′ , we use an n-shared Boolean implementation of the hash function G, and obtain as output the Boolean n-shared valueŝ K ′ and r ′ . 3. At Line 3 of Algorithm 7, we start with Line 4 of Algorithm 5 which is the masked binomial sampling. Starting from the Boolean n-shared r ′ , we must obtain values r, e 1 and e 2 which are arithmetically n-shared modulo q. For this we use the n-shared binomial sampling from [CGMZ22], based on Boolean to arithmetic modulo q conversion (based on table recomputation). We use the random generation modulo q described in Appendix A. 4. We proceed with lines 5 and 6 of Algorithm 5. We obtain the values A T · r + e 1 and t T · r + e 2 + ⌊q/2⌉ · m arithmetically n-shared modulo q. In particular, the n-shared value ⌊q/2⌉ · m is obtained using the table-based Boolean to arithmetic modulo q conversion from [CGMZ22]. 5. At Line 6 of Algorithm 5, the n-shared value t T · r + e 2 + ⌊q/2⌉ · m is high-order compressed into v ′ using the HOCompress algorithm from Section 5.1. The value v ′ is therefore Boolean n-shared in {0, 1} dv . On the other hand, the vector u ′ at Line 5 is left uncompressed. 6. For the ciphertext comparison at Line 4 of Algorithm 7, we use the hybrid technique from Section 5.3 with the arithmetically masked modulo q uncompressed vector u ′ , and the Boolean masked compressed value v ′ . We obtain a bit b in the clear. 7. Finally, if b = 1, we use the Boolean n-sharedK ′ to obtain a Boolean n-shared session key K, using an n-shared implementation of H. Similarly, if b = 0, we use the Boolean n-shared secret z to obtain the Boolean n-shared session key K.

Fully masked implementation of Saber
Saber [BMD + 21] is based on the hardness on the module learning-with-rounding (M-LWR) problem. The difference with Kyber is that instead of explicitly adding error terms e, e 1 , e 2 from a "small" distribution, the errors are deterministically added by applying a rounding function mapping Z q to Z p with p < q. For Saber, both p and q are powers of two; therefore the rounding function is a shift extracting the log 2 (p) most significant bits of its input. We provide in Appendix G the description of a fully masked implementation of Saber, as we did for Kyber in the previous section.

Implementation of Kyber and Saber
We have implemented in C a high-order version of the algorithm Kyber.Decaps, following the description of Section 6.2. For comparison, we have also performed a high-order implementation of the Saber algorithm. We refer to Appendix G.2 for a description. For both schemes, we have targeted the parameter set corresponding to NIST security category 3 (parameter Kyber768 for Kyber, see Table 7). We have run our implementation on a laptop and an embedded component. We provide the source code of the laptop implementation at:

https://github.com/fragerar/HOTableConv/tree/main/Masked_KEMs
For the embedded component, we have used a 100 MHz ARM Cortex-M3 architecture with 48k of RAM, which also includes a hardware accelerator for secure 32-bit random generation. Such component is used in real-life products like bank cards, passports, secure elements, etc. 4 The embedded code is almost the same as for the laptop implementation, except for the random generation which uses the hardware accelerator. We have also performed some RAM optimization in order to reduce the number of temporary variables without changing the number of operations.
Kyber. Our high-order implementation of Kyber.Decaps follows the description from Section 6.2. To generate random integers modulo q, we use the technique described in Algorithm 8 (see Appendix A), starting from a 32-bit random number generator. The timings are summarized in Table 12. For the embedded implementation, we can reach at most a security order of 3, due to RAM limitation.  Saber. Our high-order implementation of Saber.Decaps follows the description from Appendix G.2. As for Kyber, the embedded implementation can reach at most a security order 3, due to RAM limitation. The timings are summarized in Table 13.  We see that for both Kyber and Saber the performance gap between the unmasked and the order 1 versions is fairly large. This is because we have used generic gadgets only, with no optimization at order 1. In practice, for first-order security, a significantly lower penalty factor could be obtained via some optimizations. In particular, all techniques based on table recomputation are much more efficient at order 1, since in that case the table can be randomized once and read multiple times.

Side-channel resistance evaluation
In order to validate our masking scheme beyond the theoretical framework, we have used the popular ChipWhisperer-Lite platform to get power traces of the execution on a Cortex-M4 core (STM32F303) of the basic zero-test gadgets from Section 3. The reason for performing benchmarks and experiments on different platforms (Intel, Cortex-M3, and Cortex-M4) is the following. Performing benchmarks on x86 enables to reach much higher security orders than for embedded implementations due to RAM limitation. For the embedded component, we have used a Cortex-M3 component since it is widely used in real-life products (bank cards, passports, etc). For the side-channel evaluation we have used the ChipWhisperer-Lite which includes a Cortex-M4 by default.
For the leakage assessment on the ChipWhisperer-Lite platform, we have rewritten the gadgets specifically at order 1 in ARM assembly, in order to limit potential side-channel unsafe modifications from the compiler. We have conducted a fixed versus random t-test using the methodology described in [SM15]. The idea is to perform several measurement of the power consumption while the device under attack is executing the targeted gadget either with a fixed secret value chosen beforehand or with a random value sampled before each measurement. One keeps track of which measurement is using the fixed value, and thus one creates two sets of traces corresponding to the fixed and the random values respectively. The t-test will then be used as a distinguisher between the two sets at each point in the power traces. If the values output by the t-test are high, it means that the two sets seems statistically different and that an adversary can potentially use this information to learn something about the secret value. In practice, we have used a set of 10 000 traces for each gadget. For each trace, a coin was flipped to determine whether the random or the fixed secret value should be used.
The results of the t-test can be found in figures 1, 2 and 3. We see that when the RNG is switched off, the random and fixed inputs are distinguishable as the t-values are well above the usual threshold |t| > 4. 5. When the random number generator is switched on, values are properly masked and the test is successful on all the zero-test gadgets.
In practice, obtaining these results was not trivial. Our first attempt was to directly take measurements using the C code used for the benchmarks in the previous section. Unfortunately, the traces were not leakage-free. We suspected that the compiler was generating assembly code that handled shares in an unsecured manner. To solve this issue, we decided to rewrite the gadgets directly in ARM assembly to obtain a full control over the instructions executed by the device. This greatly improved the situation, but in some cases, we still had some unexplained leakage due to the micro-architecture. More specifically, it appeared that manipulating the shares one after the other was a bad practice and would create leakage between instructions. This can be fixed by either rethinking the code in order to create some space or adding some dummy instructions.

Conclusion
In this paper, we have described efficient techniques for high-order polynomial comparison, as used in lattice-based schemes with the Fujisaki-Okamoto transform. As an application, we have considered the high-order polynomial comparison in the NIST encryption standard Kyber. We have provided the first high-order description of the Compress function in Kyber, in order to perform the comparison on compressed ciphertexts. We have shown that the best approach is actually hybrid, with the Compress function being applied only on the last part of the ciphertext, while the rest is left uncompressed for the comparison. Finally, we have provided a complete description of the high-order masking of the IND-CCA decryption of the Kyber scheme at any order, with the practical results of a C implementation, and a t-test evaluation.

A Random generation modulo q
To generate a random integer in Z q , one can generate a random k-bit integer where the gap 2 k − i · q is small for some i. By rejection sampling, one obtains a uniformly distributed integer in [0, i·q[, from which we obtain a uniformly distributed integer modulo q, with rejection probability 1−i·q/2 k . For example, with q = 3329, one can take k = 16 and i = 19, with rejection probability 0.035. We can also use the trick described in [Lum13, Section 3]. It consists in generating a random integer modulo q 2 , which enables to extract two random integers modulo q; we can of course use higher powers of q. As previously, we generate a random k-bit integer such that the gap 2 k − i · q 2 is small. The rejection probability is then 1 − i · q 2 /2 k . For example, with q = 3329, we can use k = 25 and i = 3, and the rejection probability is 0.009, so we are using 12.5 bits per random integer modulo q, with rejection probability 0.0046 per random integer. We can also use k = 32 and i = 387, which gives 16 bits per random integer as previously, but with rejection probability 0.0007 per random integer (instead of 0.035, so a factor 50 improvement in rejection rate). We describe the pseudo-code below, to be run with parameters (i, k, q) = (387, 32, 3329).

Algorithm 8 randomModq
Input: Parameters i, k, q such that i · q 2 < 2 k . Output: r 1 , r 2 uniformly distributed in Z q . 1: r := 2 k 2: while r ≥ i · q 2 do 3: r ← {0, 1} k 4: end while 5: r := r mod q 2 6: return (r mod q, ⌊r/q⌋) B Share recombination B.1 Share recombination with n linear mask refreshing When performing the comparison between two ciphertexts c and c ′ , the output bit b of the comparison must eventually be computed in the clear, which means that the n shares b i of b must eventually be recombined. We first recall the approach used in [Cor14] for recombining the shares when outputting the AES ciphertext. It consists in performing a sequence of n mask refreshing, each of complexity O(n), so that the share recombination can be perfectly simulated, knowing the output bit b. Namely, this output bit b of the comparison can be given for free to the simulator, since that bit b is eventually known by the adversary. Following [Cor14], this enables to prove the (n − 1)-NIo property of the share recombination algorithm, according to Definition 3 from [BBE + 18]. In Appendix B.2, we describe a slightly more efficient approach, still with complexity O(n 2 ), but using only half the randomness.
We first recall the LinearRefreshMasks algorithm from [RP10], working in any finite field F.
Lemma 2. The RecombineShares algorithm is (n − 1)-NIo, when y is given to the simulator.
Proof. Since the adversary has at most t < n probes, at least one of the LinearRefreshMasks has not been probed. Let 1 ≤ i ⋆ ≤ n be the corresponding index. Any probe for i < i ⋆ can be simulated by including some index j in a set I, initially empty. Given the knowledge of y, the n outputs of the LinearRefreshMasks at index i ⋆ can be perfectly simulated, simply by generating random y 1 , . . . , y n such that y = y 1 + · · · + y n . Therefore, any probe after the index i ⋆ can be perfectly simulated. Finally, any probe can be perfectly simulated from the knowledge of y and the input y |I , with |I| ≤ t. ⊓ ⊔

B.2 Share recombination: a more efficient approach
In the following, we describe a variant of the above share recombination technique, using only half the randomness. For this, we use the following extension of the t-SNI security notion recently introduced in [CS21], which was used to prove the security of the ISW construction in the stateful model. Under this definition called free-SNI, all output variables except one can always be perfectly simulated (which is not necessarily the case in the original SNI definition). Moreover it was shown in [CS21] that the RefreshMasks algorithm (which we recall in Appendix C.2) satisfies the extended notion.
Definition 4 (Free-t-SNI security). Let G be a gadget taking as input n shares (a i ) 1≤i≤n and outputting n shares (b i ) 1≤i≤n . The gadget G is said to be free t-SNI secure if for any set of t 1 ≤ t probed intermediate variables, there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables and the output variables b |I can be perfectly simulated from a |I , while for any O ⊊ [1, n] \ I the output variables in b |O are uniformly and independently distributed, conditioned on the probed variables and b |I .
Thanks to the free-SNI definition, we can now simulate all output variables, if the simulator is given the value encoded by those output variables, that is b = b 1 + · · · + b n . We can then recombine the output shares of the gadget, and all intermediate variables in the recombination can be perfectly simulated.
In particular, assume that we must recombine the shares a 1 , . . . , a n . For this, we first compute (b 1 , . . . , b n ) ← RefreshMasks(a 1 , . . . , a n ) and eventually compute b = b 1 +· · ·+b n , which gives b = a 1 +· · ·+a n as required. The advantage of using RefreshMasks instead of n LinearRefreshMasks as in the previous section, is that we use n(n − 1)/2 random values instead of n(n − 1).
Lemma 4. Let G be a gadget taking as input n shares (a i ) 1≤i≤n and outputting n shares (b i ) 1≤i≤n . Assume that G satisfies the free-t-SNI property. Then for any set of t 1 ≤ t intermediate variables, there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables and all output variables (b i ) 1≤i≤n can be perfectly simulated from a |I and b = b 1 + · · · + b n .
Proof. We use the set I obtained from Definition 4. If |I| = n, we can simulate all output variables. Otherwise, let i ⋆ / ∈ I. We let O such that From Lemma 4, all output variables of a free-SNI gadget can be simulated using the knowledge of the encoded value b. Therefore, all intermediate variables in the subsequent recombination can be perfectly simulated. This shows that the resulting gadget satisfies the NI property. 5 This enables to prove the probing security of the full circuit by composition.
Corollary 1 (NI security). Let G be a gadget taking as input a sequence of n shares (a i ) 1≤i≤n and outputting n shares (b i ) 1≤i≤n , and let G ′ be the same as G but outputting b = b 1 + · · · + b n . Assume that G satisfies the free-t-SNI property. Then for any set of t 1 ≤ t intermediate variables of gadget G ′ , there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables can be perfectly simulated from a |I and b.
We stress that in Lemma 4, even if t 1 = n − 1 of the output variables are probed and the output bit b is known, only t 1 = n − 1 input variables a i of the gadget must be known, which means that only n − 1 output variables of the preceding gadget are required for the simulation. Whereas, without a free-SNI gadget such as RefreshMasks, all n input shares must be known for the recombination, which prevents to prove the probing security of the full circuit.

C.1 The SecAnd algorithm
We first recall the SecAnd algorithm that enables to compute the And between two Boolean masked values with n shares. The algorithm is a variant with k-bit words of the original ISW algorithm. The algorithm has complexity O(n 2 ), with a number of operations : Lemma 5 ([BBD + 16]). The SecAnd algorithm is (n − 1)-SNI.

C.2 Mask refreshing
We recall the RefreshMasks algorithm, where the operations are performed in any group G, for example the additive group Z q for any integer q.

C.3 Boolean zero testing in {0, 1} k
We consider the zero-testing of x ∈ {0, 1} k from its Boolean shares. We consider the k bits of x = x (k−1) · · · x (0) . The zero-testing of x computes a bit b with b = 1 if x = 0, and b = 0 otherwise; therefore: Starting from the n Boolean shares of x = x 1 ⊕ · · · ⊕ x n , the right-hand side of the above equation can be computed by a sequence of k − 1 secure And (see Appendix C.1). For simplicity we actually perform k iterations of SecAnd, the first one being a SecAnd with encoded input 1, to avoid an explicit mask refreshing at the beginning. The shares b 1 , . . . , b n are eventually recombined using the RecombineShares algorithm (see Appendix B.1). We obtain Algorithm 13 below.
Alternatively, at Line 6 one can replace the RecombineShares algorithm by the RefreshMasks algorithm, followed by the computation of b = b 1 ⊕ · · · ⊕ b n . In that case, the ZeroTestBool algorithm up to RefreshMasks is free-(n − 1)-SNI. From Corollary 1, this implies that the full ZeroTestBool is (n − 1)-NI, when b is given to the simulator.

⊓ ⊔
The advantage of using RefreshMasks instead of RecombineShares is that we use n(n − 1)/2 random values instead of n(n − 1). Therefore, in the rest of the paper, we will use this later approach.

C.4 Boolean zero-test with complexity O(n 2 · log k)
For a k-bit input x, the previous algorithm has complexity O(n 2 ·k). In the following, we describe an improved algorithm ZeroTestBoolLog with complexity O(n 2 · log k), by taking advantage of the And operations on k-bit registers, instead of single bits. We also provide a more precise operation count.
Let x ∈ {0, 1} k and let x = x 1 ⊕ · · · ⊕ x n a Boolean sharing of x. We describe a procedure to zero-test x in O(n 2 · log k) operations on k-bit registers, instead of O(n 2 · k) with the previous approach. The technique is as follows. We write the k bits of x. Let m = ⌈log 2 k⌉. If k is not a power of two, then we set the most significant bits of x to 1 until the next power of two, which is 2 m . Let f i (x) = x ∧ (x ≫ 2 i ). We prove below that we have: Therefore to zero-test x, we can compute: We describe in Algorithm 14 below the high-order computation of the previous equation with n shares, using the SecAnd and RefreshMasks algorithms from sections C.1 and C.2.
Indeed, we have By using the recurrence hypothesis in Equation (8), we get which terminates the recursive proof.
Eventually, the result also holds for x = x 1 ⊕ · · · ⊕ x n with n > 1, since the same operations are performed on all shares, which proves the theorem.

Proof.
It is easy to see that the composition of steps 5 and 6 is (n − 1)-SNI secure, from the (n − 1)-SNI security of SecAnd and RefreshMasks. Therefore, the ZeroTestBoolLog algorithm up to Line 7 satisfies the (n − 1)-SNI property, and therefore the full algorithm is (n − 1)-NIo when b is given to the simulator. ⊓ ⊔

C.5 Zero testing modulo q via arithmetic to Boolean conversion
We now consider the zero-testing of an element x ∈ Z q from its arithmetic shares. Given as input the n arithmetic shares of x = x 1 + · · · + x n mod q, we must output a bit b, with b = 1 if x = 0 and b = 0 if x ̸ = 0, without revealing more information about x. For q ≤ 2 k , we first perform an arithmetic to Boolean conversion, which gives the Boolean shares y 1 , . . . , y n ∈ {0, 1} k , with x = y 1 ⊕ · · · ⊕ y n . We then apply the Boolean zero-testing algorithm from the previous section. We obtain the pseudo-code below.

Algorithm 15 ZeroTestAB
Input: q ∈ Z, k ∈ Z with q ≤ 2 k , and x 1 , . . . , x n ∈ Z q Output: b ∈ {0, 1} with b = 1 if i x i = 0 (mod q) and b = 0 otherwise 1: (y 1 , . . . , y n ) ← ArithmeticToBoolean(q, (x 1 , . . . , x n )) 2: return ZeroTestBoolLog(k, (y 1 , . . . , y n )) The arithmetic to Boolean conversion step has complexity O(n 2 ·k) for q = 2 k , using [CGV14] or the table recomputation approach from [CGMZ22]. The technique actually works for arithmetic masking modulo any integer q, since we can use [BBE + 18,SPOG19] to convert from arithmetic modulo q to Boolean masking, with complexity O(n 2 · k) for a k-bit modulus q. In the second step, one can use the improved algorithm ZeroTestBoolLog from Appendix C.4 with complexity O(n 2 · log k). Therefore the overall complexity is O(n 2 · k), where k = ⌈log 2 q⌉, with a number of operations: where T AB (k, n) is the complexity of the arithmetic to Boolean conversion for a k-bit modulus q.
Proof. The result follows from Theorem 8, with the ArithmeticToBoolean algorithm which is assumed to be (n − 1)-NI.

C.6 Zero testing modulo 2 k for small k via table recomputation
The technique is a direct application of the table-based conversion algorithm from [CGMZ22]. Namely the table recomputation from [CGMZ22] can high-order compute any function f : G → H, for any groups G and H. So it suffices to take G = Z q and H = {0, 1}, with f (x) = 1 if x = 0 (mod q) and f (x) = 0 otherwise. The technique has complexity O(q · n 2 ), which can be prohibitive for large q. For q = 2 k and small k, we can use the register optimization from [CGMZ22]. In that case the countermeasure has complexity O(n 2 ) only, assuming that we have access to 2 k -bit registers. Therefore this optimization can only work for small k, say up to k = 8.
More precisely, the technique first initializes a table T with q rows, where for 0 ≤ i < q the i-th row contains a n-shared Boolean encoding of 1 for i = 0, and 0 otherwise:
end for 9: end for 10: (b 1 , . . . , b n Proof. From [CGMZ22], the ZeroTestTable algorithm up to Line 10 is (n − 1) − SNI. Thanks to the last RefreshMasks, it is actually free-(n−1)−SNI. Therefore the full algorithm is (n−1)−NIo when the output b is given to the simulator.

C.7 Secure Multiplication modulo q
We recall hereafter the SecMult algorithm as already considered in [SPOG19]. Note that the number of operations of SecMult is n · (7n − 5)/2 by considering random generation in Z q , addition and multiplication modulo q as a single operation.

Algorithm 17 SecMult
Input: x 1 , . . . , x n ∈ Z q , y 1 , . . . , y n ∈ Z q Output: z 1 , . . . , z n ∈ Z q such that i z i = ( i x i ) · ( i y i ) mod q. 1: for i = 1 to n do z i ← x i · y i mod q 2: for i = 1 to n do 3: for j = i + 1 to n do 4: r ← Z q 5: r ′ ← (r + x i · y j mod q) + x j · y i mod q 6: z j ← z j + r ′ mod q 8: end for 9: end for C.8 Zero testing modulo a prime q via exponentiation Algorithm 18 SecExpo Input: A 1 , . . . , A n ∈ Z q with i A i = x (mod q) for prime q, an exponent e. Output: B 1 , . . . , B n  Complexity. The complexity of the SecExpo algorithm with e = q − 1 is O(n 2 · log q), assuming that a multiplication modulo q takes unit time. More precisely, the number of operations of SecMult is n · (7n − 5)/2. The number of operation of RefreshMasks is n · (3n − 1)/2. For q = 3329, the algorithm requires 13 squares and 4 multiplies. This means 4 RefreshMasks and 17 SecMult. The total number of operations for SecExpo is therefore n·(131n−89)/2. Eventually, the number of operations for ZeroTestExpoShares is: T ZeroTestExpoShares (n) = n · (131n − 87)/2 and is finally n · (67n − 43) for ZeroTestExpo.

C.9 Proof of Theorem 1
We describe hereafter the construction of the set I ⊂ [1, n] of indices. Initially, I is empty. For every probed input variable x i and for any probed intermediate variable B i at Loop j between Steps 3 and 5, for 1 ≤ i ≤ n, we add index i to I. By construction of the set I, we have |I| ≤ t as required.
We now show that any t probes of Algorithm ZeroTestMult can be perfectly simulated from x |I and b. Since the number of probes t is such that t < n, we deduce that at least one entire loop (Steps 3 to 5) has not been probed. Let j ⋆ be the index of this non-probed loop. For all probed variables B i between Steps 3 and 5 in loop indices j < j ⋆ , we have i ∈ I and the simulation is straightforward from the input shares x |I .
It remains to simulate all probed variables between Steps 3 and 5 in loop indices j ≥ j ⋆ , and all probed variables at Step 7. To this aim, we consider two cases whether the output b = 0 or b = 1 (recall that b is given to the simulator).
If b = 1, then we know that n i=1 B i = 0 (mod q) at the end of each for loop. At the end of loop j ⋆ , since LinearRefreshMasks has not been probed, we can perfectly simulate all variables B i , by generating random B i 's for 1 ≤ i ≤ n such that n i=1 B i = 0 (mod q). Similarly, if b = 0, we use the fact that u j ⋆ has not been probed and acts as a multiplicative one-time pad in Z * q . This implies that the value encoded by the B i 's is randomly distributed in Z * q . We can therefore perfectly simulate all shares B i for 1 ≤ i ≤ n at the end of loop j ⋆ by generating random B i 's under the condition n i=1 B i ̸ = 0 (mod q). In both cases, one can propagate the simulation until the end of the for loop, that is until j = n, and from the knowledge of the B i shares at the end of the for loop, one can compute all probed intermediate variables at Step 7 as in the real algorithm. We therefore conclude that the ZeroTestMult algorithm is (n − 1)-NIo, when the output b is given to the simulator.

C.10 Proof of Theorem 2
The SecExpo algorithm is (n − 1)-SNI since it is the composition of several iterations of the SecMult algorithm which is (n − 1)-SNI, with some RefreshMasks operations which are also (n − 1)-SNI. The ZeroTestExpoShares algorithm is (n − 1)-SNI since it is essentially composed by the SecExpo algorithm which is (n − 1)-SNI, where the output variables are simply modified with some known constants. This implies that the ZeroTestExpo algorithm is (n − 1)-NIo.

D.1 Polynomial comparison of Boolean masked coefficients
We are given as input a set of ℓ · n shares (x (j) ) i ∈ {0, 1} k for 1 ≤ j ≤ ℓ and 1 ≤ i ≤ n, corresponding to ℓ coefficients: 1 ⊕ · · · ⊕ x (j) n and we must output a single bit b such that b = 1 if x (j) = 0 for all 1 ≤ j ≤ ℓ, and b = 0 otherwise. The simplest approach is to perform a Boolean zero-test of each x (j) as in Section C.3, keeping each resulting bit b (j) in Boolean n-shared form, and then to perform a sequence of SecAnds between the bits b (j) , and to eventually recombine the shares into a bit b. The complexity of this approach is then O(ℓ · n 2 · log k). A slightly better approach is to high-order compute: Then y = 0 iff x (j) = 0 for all 1 ≤ j ≤ ℓ, so we eventually perform a single zero-test of y. In this approach we take advantage of computing the SecAnds over k bits instead of a single bit. The complexity is then O(ℓ · n 2 + n 2 · log k). We obtain the pseudo-code below.

D.2 Polynomial comparison modulo 2 k via arithmetic to Boolean conversion
We are given as input a set of ℓ · n shares (x (j) ) i for 1 ≤ j ≤ ℓ and 1 ≤ i ≤ n, corresponding to ℓ coefficients: and we must output a single bit b such that b = 1 if x (j) = 0 for all 1 ≤ j ≤ ℓ, and b = 0 otherwise. For this we simply perform an arithmetic to Boolean conversion of each coefficient x (j) separately and then apply the previous PolyZeroTestBool algorithm. The complexity of each Boolean to arithmetic conversion is O(n 2 · k) for k = ⌈log 2 q⌉. Therefore the total complexity is O(ℓ · n 2 · k). We provide the pseudocode of the corresponding algorithm PolyZeroTestAB below.

Algorithm 22 PolyZeroTestAB
Input: q ∈ Z, k ∈ Z with q ≤ 2 k , and (x The number of operations is T PolyZeroTestAB (k, n) = ℓ · T AB (k, n) + T PolyZeroTestBool (k, n) The following theorem shows that the adversary does not learn more than the output bit b of the comparison. The proof is straightforward and therefore omitted.

D.3 Polynomial comparison modulo prime q: reduction step
When working modulo a prime q, we can apply the technique from [BDH + 21] that efficiently reduces the zero-testing of ℓ coefficients to the zero-testing of κ ≪ ℓ coefficients, with κ = ⌈λ/ log 2 q⌉, where λ is the security parameter. Given as input ℓ coefficients x (j) ∈ Z q with arithmetic shares x (j) i , the technique consists in computing κ linear combinations: for 1 ≤ k ≤ κ, with randomly distributed coefficients a kj ∈ Z q . The above equation is actually high-order computed using the arithmetic shares x (j) i of each x (j) , and we obtain the arithmetic shares y (k) i of each coefficient y (k) . We obtain the pseudo-code below.
We stress that after this reduction step we cannot zero-test the coefficients y (k) separately. Otherwise, since the coefficients a kj in (9) are computed in the clear, knowing that y (k) = 0 for some k would leak an equation over the coefficients x (j) , which would leak information about the x (j) with fewer than n probes. Instead, the remaining κ coefficients y (k) must be zero-tested all at once. This reduction technique is quite efficient because the random coefficients a kj in (9) are non-masked, which implies that each multiplication a kj · x (j) can be computed in time O(n) for n shares, instead of O(n 2 ) for a fully masked multiplication. The total complexity of this first step is therefore O(ℓ · κ · n), with a number of operations: T PolyZeroTestRed (κ, ℓ, n) = κ · ℓ · (2n + 1) Theorem 15 ([BDH + 21]). The PolyZeroTestRed algorithm is (n − 1)-NI.
for i = 1 to n do z i ← 0

D.5 Proof of Theorem 3 (soundness of PolyZeroTestMult)
We denote by PolyZeroTestMultLoop, one loop iteration on k of the PolyZeroTestMult algorithm, namely, going from line 4 to 11. We start by showing that PolyZeroTestMultLoop computes the correct answer b k , except with probability at most 1/q. Indeed, in PolyZeroTestMultLoop, one securely computes the value z = ℓ j=1 a (j) · y (j) (mod q) where the random values a (j) are uniformly distributed in Z q . Thus, if z ̸ = 0, then at least one coefficient y (j) is not zero and the output b k = 0 is always correct.
However, if z = 0, two cases arise: either all coefficients y (j) are null in which case the algorithm outputs b k = 1 which is correct, or at least one coefficient y (j) is such that y (j) ̸ = 0 but with ℓ j=1 a (j) · y (j) = 0 (mod q) and the output b k = 1 in this case is incorrect. Since the a (j) values are uniformly distributed in Z q , the result of the linear combination of the y (j) ̸ = 0 with the values a (j) is also uniform in Z q . Therefore the probability that ℓ j=1 a (j) · y (j) = 0 (mod q) is 1/q for each iteration of PolyZeroTestMultLoop.

D.6 Proof of Theorem 4 (security of PolyZeroTestMult)
As before, we denote by PolyZeroTestMultLoop, one loop iteration on k of the PolyZeroTestMult algorithm (line 4 to 11). We write y (j) = n i=1 y (j) i mod q. We distinguish two cases: either y (j) = 0 for all 1 ≤ j ≤ ℓ, or y (j) ̸ = 0 for some j. We show that the simulator can perform a perfect simulation in both cases. Moreover, by assumption the simulator eventually receives the bit b. This means that the simulator can distinguish the two cases, except with error probability at most q −κ . Therefore the error probability of the simulator will be at most q −κ . y (j) = 0 for all 1 ≤ j ≤ ℓ. This is the easy case. Namely in that case, we know that b k = 1 for all k. The computation of the shares z i at Line 8 is (n − 1)-SNI. Knowing b k , the algorithm ZeroTestMult at Step 10 is (n−1)-NI from Theorem 1. Therefore the global PolyZeroTestMultLoop algorithm remains (n − 1)-NI. y (j) ̸ = 0 for some 1 ≤ j ≤ ℓ. We consider a sequence of games.
Game 0 : we generate all variables as in the algorithm. We assume that we know all input shares y (j) i . We can therefore perform a perfect simulation of all probes. Moreover, we have that Pr[b k = 1] = 1/q for all 1 ≤ k ≤ κ, and the variables b k are independently distributed.
Game 1 : we modify the way the variables are generated. Instead of generating all variables a (j) i uniformly and independently, we first generate the bits b k independently with Pr[b k = 1] = 1/q. Then for each 1 ≤ k ≤ κ, if b k = 1 then we generate the shares a (j) i such that ℓ j=1 a (j) y (j) = 0 (mod q), where a (j) = n i=1 a (j) i mod q. Otherwise, we generate the shares a (j) i such that ℓ j=1 a (j) y (j) ̸ = 0 (mod q). The distribution of the variables is the same as in the previous game. Therefore, we can still perform a perfect simulation of all probed variables.
Game 2 : we show that we can still perform a perfect simulation as in Game 1 , but only with the input shares y (j) |I for a subset |I| ≤ t. This will prove that the algorithm is (n − 1)-NI. Firstly, from the (n−1)-SNI property of SecMult and the (n−1)-NI property of ZeroTestMult knowing b k , the simulation of all probes can be performed from the knowledge of a subset y (j) |I of the input shares for |I| ≤ t, and a subset a (j) |J of the shares of the values a (j) , for |J| ≤ t ≤ n − 1. Secondly, the constraints on ℓ j=1 a (j) y (j) from Game 2 can be satisfied by generating all shares a (j) i for i ̸ = i ⋆ uniformly at random, and by fixing a i ⋆ is not needed for the simulation, we can perform a perfect simulation of all probes from y (j) |I . This concludes the proof.
Note that a n-sharing of the coefficients a (j) is required for the simulation. If the coefficients a (j) were computed in the clear, one could not satisfy the constraints on the linear sums, without knowing the coefficients y (j) .
Generating the list of candidates. From Lemma 1, to generate the list of candidates Compress −1 q,d (c), it suffices to consider the set [a, b] q with a = y − B q,d and b = y + B q,d and to test whether the two elements at the border belong to the set, that is we check whether Compress q,d (a) = c and Compress q,d (b) = c. We provide the pseudocode in Algorithm 26 below.

Algorithm 26 CompressInv
Input:c ∈ Z 2 k Output: a, b ∈ Z such that Compress −1 q,d (c) = [a, b] q . 1: B q,d ← q 2 d+1 2: y ← Decompress q,d (c) 3: a ← y − B q,d mod q, b ← y + B q,d mod q 4: if Compress q,d (a) ̸ =c then a ← a + 1 mod q 5: if Compress q,d (b) ̸ =c then b ← b − 1 mod q 6: return a, b SecAnd) of the most significant bit of the two results (using the Boolean shares). The number of operations is therefore: T range (k, n) = 2 · T AB (k + 1, n) + T SecAnd (n)

Algorithm 28 RangeTestShares
Input: x 1 , . . . , x n ∈ Z q for prime q with i x i = x (mod q), k = ⌊log 2 q⌋, bounds a and b s.t.

F Ring-LWE IND-CPA encryption
We provide an overview of ring-LWE encryption [LPR10]. For N ∈ Z and q ∈ Z, let R and R q denote the rings Z[X]/(X N + 1) and Z q [X]/(X N + 1) respectively. Let a ∈ R q be a public random polynomial. Let χ be a distribution outputting "small" elements in R, and let s, e ← χ.

G.1 The Saber Key Encapsulation Mechanism (KEM)
Saber [BMD + 21] is based on the hardness on the module learning-with-rounding (M-LWR) problem. The difference with Kyber is that instead of explicitly adding error terms e, e 1 , e 2 from a "small" distribution, the errors are deterministically added by applying a rounding function mapping Z q to Z p with p < q. For Saber, both p and q are powers of two; therefore the rounding function is a shift extracting the log 2 (p) most significant bits of its input.
The Saber submission provides three parameters sets LightSaber, Saber and FireSaber with claimed security level equivalent to AES-128, AES-192 and AES-256 respectively; see Table 14. We recall the pseudocode below. The constants h 1 , h 2 and h are needed to center the errors introduced by rounding around 0. We write q = 2 ϵq and p = 2 ϵp .