High-order Table-based Conversion Algorithms and Masking Lattice-based Encryption

. Masking is the main countermeasure against side-channel attacks on embedded devices. For cryptographic algorithms that combine Boolean and arithmetic masking, one must therefore convert between the two types of masking, without leaking additional information to the attacker. In this paper we describe a new high-order conversion algorithm between Boolean and arithmetic masking, based on table recomputation, and provably secure in the ISW probing model. We show that our technique is particularly eﬃcient for masking structured LWE encryption schemes such as Kyber and Saber . In particular, for Kyber IND-CPA decryption, we obtain an order of magnitude improvement compared to existing techniques.

considered separately have a uniform distribution [CJRR99]. However it is still possible to perform a second-order attack combining information leakage about the two shares x and r; this usually requires a much larger amount of side-channel traces, see for example [OMHT06].
To prevent such higher-order attacks, a generalization of the masking countermeasure consists in splitting any variable x into n shares with x = x 1 ⊕ · · · ⊕ x n . The shares x i must then be processed separately in order to avoid any leakage of information about the original variable x. To formally argue about the security provided by the masking countermeasure, Ishai, Sahai and Wagner introduced in [ISW03] the probing model, by considering an adversary who can probe at most t wires in a circuit. Using the masking countermeasure with n = 2t + 1 shares, they showed how to transform any boolean circuit C into a circuit of size O(|C| · t 2 ) that is perfectly secure against such adversary. Moreover, in [DDF14], it was shown that security in the probing model implies security against noisy leakage, under the assumption that every variable leaks independently.
Security in the probing model is usually proven by simulation: one must show that any set of t probes can be perfectly simulated without knowing the original variables of C. To facilitate the writing of security proofs, Barthe et al. introduced the notions of (Strong) Non-Interference (NI/SNI), to allow easy composition of gadgets [BBD + 16]. The authors proved the t-SNI property for the original ISW multiplication gadget. They also showed that with some additional mask refreshing, a circuit C can be made secure against t probes with n = t + 1 shares only, instead of n = 2t + 1 shares in [ISW03].
While block-ciphers such as AES are typically protected using Boolean masking as above [RP10], lattice-based schemes often require a combination of arithmetic and Boolean masking. This implies that conversions between arithmetic and Boolean masking play an essential role in masked implementations of lattice-based schemes. Such conversions were previously used in block-ciphers and hash functions combining Boolean and arithmetic operations (such as SHA-1). However, they were based solely on power-of-two moduli, while many lattice-based schemes use a prime modulus (as in Kyber). The conversions must therefore be adapted in the context of post-quantum cryptography.
First-order conversion algorithms. The first conversion algorithms were proposed by Goubin in [Gou01], with security against first-order attacks. The Boolean to arithmetic conversion is efficient and has an optimal complexity O(1). The conversion from arithmetic to Boolean masking is less efficient as its complexity is O(k) for conversion from arithmetic masking modulo 2 k . This was later improved to O(log k) in [CGTV15]; however in practice for k = 32 the number of operations was similar.
A table-based conversion from arithmetic to Boolean masking was described in [CT03], for first-order security only. For a small value of k, the conversion can be done by a simple table look-up, using a pre-computed table. This is similar to the classical first-order randomized SBox table countermeasure [CJRR99]. More precisely, the algorithm uses a randomized pre-computed table T : Z 2 k → {0, 1} k which is initialized as follows. First, one generates a random mask r ← Z 2 k . As a second step, one computes T [v] = (v + r) ⊕ r for all v ∈ Z 2 k . Then, given an arithmetically masked value A = x − r (mod 2 k ), one obtains a Boolean masking x of x by simply reading the table T at index A, i.e. x = T [A]; this gives x = (A + r) ⊕ r = x ⊕ r as required. The same randomized table can be used multiple times; therefore once the table has been initialized for all possible values in Z 2 k , each conversion is a simple table look-up.
The authors also showed how to extend the technique to convert variables of k = δ · bits, by propagating the carry by blocks of bits. However there was a flaw in their algorithm: they computed the carry table modulo 2 only, instead of modulo 2 k− ; therefore the algorithm is incorrect for δ > 2; this mistake was identified and corrected by Debraize in [Deb12]. In [Deb12], the author described multiple first-order conversion algorithms, The randomized table countermeasure. For protecting the computation of an SBox against side-channel attacks, the classical first-order randomized table countermeasure from [CJRR99] was extended to high-order in [Cor14]. The high-order table recomputation countermeasure works as follows. Given a k-bit SBox S, one generates a table T with 2 k rows, where each row consists of n shares. Given as input n shares x i such that x = x 1 ⊕ · · · ⊕ x n , the goal is to compute an n-sharing of y = S(x), without leaking information about x. The rows of T are initialized with T (u) = (S(u), 0, . . . , 0) for all u ∈ {0, 1} k , and one maintains the invariant such that after Step i the rows of T are n-encodings of the rows of S, but shifted by x 1 ⊕ · · · ⊕ x i . For this one incrementally shifts the rows of the table by the successive input shares x 1 , . . . , x n−1 ; the n-encodings on each row are refreshed between every shift. In the end, the rows of the table have been shifted by x 1 ⊕ · · · ⊕ x n−1 , so it suffices to read the table T at the row u = x n to get the n output shares y i corresponding to y = S(x).
The countermeasure was proven secure against t probes in the ISW model with n = 2t+1 shares, and later proven SNI in [CRZ18], so that it can work with n = t + 1 shares only. The authors also described a variant where the number of output shares in the randomized table T is progressively increased from 1 to n, which saves a factor 2 in running time. For a k-bit SBOX and n shares, the countermeasure has complexity O(2 k · n 2 ). Table 1: Complexities of Boolean vs arithmetic conversions in both directions, for firstorder attacks, and for high-order attacks with n shares. For all algorithms the complexity is for k-bit register, except for Algorithm 12 which uses 2 k -bit registers. We indicate the Mod q property when the arithmetic masking can be modulo any q, not only modulo 2 k .

Direction
Mod q First-order High-order Memory complexity complexity complexity SPOG19] O(n 2 · log k) [SPOG19] (1-bit) B → A - First contribution: high-order table-based conversion. Our first contribution is to extend the table-based conversion algorithm between Boolean and arithmetic masking of [CT03] from first-order to any order. For this we extend the high-order table recomputation countermeasure from [Cor14] recalled above. Namely we observe that in [Cor14], the incremental shifting of the rows of the table T can be performed according to any additive group G, not only for the xor operation in {0, 1} k . For example we can work modulo 2 k as input, which automatically gives a high-order conversion from arithmetic to Boolean masking. Similarly, the n-encoding of the rows of T as output can be according to any group law, not only for the xor in {0, 1} k . This implies that we can easily convert from Boolean to arithmetic masking modulo any integer q, which is useful in the context of lattice-based cryptography (see below). More generally, our extended table recomputation countermeasure allows computing any function f : G → H, for any group G as input and any group H as output. Given as input an n-sharing of x = x 1 + · · · + x n ∈ G, we can compute n outputs shares y i ∈ H such that y 1 + · · · + y n = f (x 1 + · · · + x n ), while being secure in the ISW probing model against t = n − 1 probes. By selecting the right groups G and H, we can therefore obtain high-order secure conversion algorithms between Boolean and arithmetic masking. To convert from Boolean to arithmetic masking modulo 2 k , we take G = {0, 1} k and H = Z 2 k and we obtain y 1 + · · · + y n = x 1 ⊕ · · · ⊕ x n (mod 2 k ) as required. Similarly for arithmetic to Boolean conversion we take G = Z 2 k and H = {0, 1} k , and we obtain y 1 ⊕ · · · ⊕ y n = x 1 + · · · + x n (mod 2 k ) as required.
The main advantage of the table-based approach for conversions is its flexibility, as we can choose any groups G and H and any function f : G → H. However the running time complexity is O(n 2 · |G|). This implies that for k-bit Boolean or arithmetic masking, the generic complexity is O(n 2 · 2 k ). Fortunately for specific groups G as input one can do much better. When converting from Boolean masking with G = {0, 1} k , we describe a simple optimization with complexity O(n 2 · k) only, as in [CGV14]. For the other direction, we first describe a technique to compute a shift by bits on some arithmetically masked value, with complexity O(n 2 · ). This is independently useful for the masking of Saber, which requires the computation of logical shifts. Our technique is based on an extension of the carry propagation technique from [CT03] to high-order security. From the arithmetic shift, it is then easy to obtain a conversion from arithmetic masking modulo 2 k to Boolean masking, again with complexity O(n 2 · k) as in [CGV14]. See Table 1 for a summary of our conversion algorithms.
Moreover we describe two optimizations for the specific case of 1-bit Boolean masking, which is useful in the context of lattice-based cryptography. For converting from 1-bit Boolean masking to arithmetic masking modulo any q, our complexity becomes O(n 2 ) only, as in [SPOG19], instead of O(n 2 · log q) with [BBE + 18]. In the other direction, we show how to efficiently compute a threshold function th from arithmetic masking modulo 2 k , whose result is a 1-bit Boolean masking. This corresponds to the decryption function in lattice-based cryptosystems. In that case, our optimization consists in putting each column of the table in a single register; the resulting complexity is O(n 2 ) only, assuming that we have access to 2 k -bit registers (see Algorithm 12 in Table 1). In practice, for both optimizations we obtain at least an order of magnitude improvement compared to the techniques in [CGV14, BBE + 18].
Second contribution: high-order masking of lattice-based encryption schemes. Our second contribution is to apply our table-based conversion algorithms to the masking of lattice-based encryption schemes such as Kyber and Saber. Recall that for the IND-CCA decryption, one must perform the following operations according to the Fujisaki-Okamoto (FO) transform: We first consider the IND-CPA decryption of the ciphertext c (Step 1). For ring-LWE encryption the ciphertext c = (c 1 , c 2 ) is decrypted with the private-key s using m = th(c 1 − s · c 2 ), where th is the threshold function th : Z q → {0, 1} where th(x) = 1 if x ∈ (q/4, 3q/4) and th(x) = 0 otherwise. The threshold function is actually applied independently on each coefficient of the polynomial u = c 1 − s · c 2 modulo q. When the private-key s is arithmetically masked modulo q with n shares, we obtain n shares for u = u 1 + · · · + u n (mod q), and we must therefore convert from an arithmetically masked u modulo q into a 1-bit Boolean masked m = m 1 ⊕ · · · ⊕ m n = th(u). For this one could use our generic table-based approach with the function f = th and f : Z q → {0, 1}. However the complexity would be O(n 2 · q), which is prohibitive for large q. Therefore we describe an optimization in which we first perform a modulus switching from masking modulo q to masking modulo 2 k (for a small k), while maintaining a negligible probability of decryption error, as required to achieve CCA-security [DNR04]. We can then convert from arithmetic masking modulo 2 k into 1-bit Boolean masking, which recovers the Boolean masked message m. This optimization has complexity O(n 2 · log n) (see Algorithm 13 in Table 1), instead of O(n 2 log q) with [BBE + 18]. In practice we obtain an order of magnitude improvement in the IND-CPA decryption of Kyber.
We also consider the masking of re-encryption of m into a ciphertext c , and the masking of the binomial sampling (Step 2). To encrypt a Boolean masked message m ∈ {0, 1}, we can use our generic table-based Boolean to arithmetic modulo q conversion algorithm. In that case the complexity is O(n 2 ) as in [SPOG19]. The same holds for the masking of the binomial sampling, which is easily computed as the sum of independent 1-bit Boolean to arithmetic modulo q conversions, as in [SPOG19]. In practice we obtain a similar level of efficiency as in [SPOG19], and an order of magnitude improvement compared to [BBE + 18].
In summary, our table-based approach for conversion between Boolean and arithmetic masking provides significant efficiency improvement in the context of lattice-based cryptography, especially for IND-CPA decryption (Step 1), while being relatively easy to implement. We provide a detailed comparison with existing conversion algorithms. We leave the high-order masking of the polynomial comparison (Step 3) for future work.

Related work on masking ring-LWE encryption scheme
First-order masking. In [RRVV15], the authors described a first-order masking of IND-CPA decryption with n = 2 shares. They describe a relatively complex masked decoder to compute m 1 ⊕ m 2 = th(u 1 + u 2 ). The decoder only works for half of the inputs, so it must be restarted up to 16 times with a certain shift δ ∈ [0, q − 1].
In [OSPG18], the authors describe a first-order masking of the full IND-CCA decryption. The IND-CPA decryption part is based on first converting the arithmetic masking modulo q into an arithmetic sharing modulo 2 k , and then converting from arithmetic to Boolean masking. The re-encryption and binomial sampling are also masked to first-order. The polynomial comparison between x and y = y 1 + y 2 (mod q) is done by checking that H(x − y 1 ) = H(y 2 ). While efficient, the techniques seem relatively difficult to generalize to high-order.
In [BDK + 20], the authors describe a first-order masked implementation of Saber, with only a 2.5x overhead factor. They introduce an optimized arithmetic to arithmetic conversion algorithm (A2A) for performing logical shifts. Their algorithm is based on securely propagating the carry with a pre-computed table, by adapting the techniques from [CT03,Deb12]. Their A2A conversion only applies to first-order masking. In this paper, we describe in Section 5 a generalization to high-order masking of the A2A computation of the logical shift, based on table recomputation. In [FBR + 21], the authors describe a first-order masking of Kyber and Saber, for both software and hardware implementations. For the masking of the Compress function in Kyber, they describe a technique based on modulus switching. They show that the error induced by the modulus switching can be eliminated by using more precision and then truncating, using an arithmetic to Boolean conversion. We use a similar modulus switching technique in Section 8.2, but extended to high-order masking.
High-order masking. In [SPOG19], the authors describe a very interesting technique to convert from 1-bit Boolean masking to arithmetic masking modulo q, with complexity O(n 2 ), for security against t = n − 1 probes. As an application, they obtain a high-order k-bit Boolean to arithmetic conversion algorithm with complexity O(k · n 2 ), and also a high-order masked binomial sampling algorithm with complexity O(k · n 2 ), where k is the length of the bit-vectors. The 1-bit Boolean to arithmetic masking modulo q is based on the following equation for x, y ∈ {0, 1}: which was already considered in [OSPG18] for first-order conversion. From the above equation, if we already have an arithmetic sharing of both x and y, we can obtain an arithmetic sharing of x ⊕ y. Such arithmetic sharing can be modulo any integer q, as long as it encodes an element in {0, 1}. Namely the product x · y can be computed with n arithmetic shares modulo q, as in the And gate in [ISW03], with complexity O(n 2 ). Using a recursive approach similar to [CGV14], one obtains a 1-bit Boolean to arithmetic masking modulo q conversion, with complexity O(n 2 ). The authors of [SPOG19] actually describe an iterative approach, still with complexity O(n 2 ).
In [BDH + 21], the authors describe an attack against the first-order masked ciphertext comparison in [OSPG18]. The attack is based on the fact that the ciphertext comparison in [OSPG18] is performed iteratively on different parts of the ciphertext, and the output of the first comparison leaks sensitive information to the attacker. The attack does not apply against the ciphertext comparison used for the protection of Saber in [BDK + 20], which implements only a single check. The authors of [BDH + 21] also describe a similar attack against the high-order polynomial comparison from [BPO + 20], which proceeds in sets of coefficients and the pass/fail bit is unmasked for every set. They also describe a clever variant attack that does not use any side-channel information. To prevent these attacks, the polynomial comparison should be an atomic operation that does not reveal partial comparison results on a subset of the coefficients.
In [BGR + 21], the authors describe the first completely masked implementation of Kyber, secure against first-order and higher-order attacks. For the IND-CPA decryption, the authors consider the Compress s q (x) function that outputs 0 if x < q/2 and 1 otherwise; this is a shifted function of the Compress q (x, 1) from Kyber. They show that Compress s q (x) = x 11 ⊕ (¬x 11 · x 10 · x 9 · (x 8 ⊕ (¬x 8 · x 7 ))), where x i is the i-th bit of x. Therefore they proceed by first converting from arithmetic masking modulo q to Boolean masking, using [BBE + 18]. Then the above function can be computed with high-order secure implementations of the And and Xor gadgets. In the full version of our paper [CGMZ21], we describe a slightly simpler approach, still based on the arithmetic modulo q to Boolean masking conversion from [BBE + 18], with asymptotic complexity O(n 2 · log log q) instead of O(n 2 · log q) in [BBE + 18]. The authors also describe a high-order secure polynomial comparison algorithm, which compares uncompressed masked polynomials with compressed public polynomials, so that the ciphertext compression from Kyber does not need to be explicitly masked.

Security definitions
We recall below the t-NI and t-SNI security notions introduced in [BBD + 16]. We consider a gadget taking as input a single n-tuple (x i ) 1≤i≤n of shares, and outputting a single n-tuple (y i ) 1≤i≤n . Given a subset I ⊂ [1, n], we denote by x |I all elements x i such that i ∈ I.
Definition 1 (t-NI security [BBD + 16]). Let G be a gadget taking as input (x i ) 1≤i≤n and outputting the vector (y i ) 1≤i≤n . The gadget G is said t-NI secure if for any set of t 1 ≤ t intermediate variables, there exists a subset I of input indices with |I| ≤ t 1 , such that the t 1 intermediate variables can be perfectly simulated from x |I . Definition 2 (t-SNI security [BBD + 16]). Let G be a gadget taking as input the n shares (x i ) 1≤i≤n , and outputting n shares (z i ) 1≤i≤n . The gadget G is said to be t-SNI secure if for any set of t 1 probed intermediate variables and any subset O of output indices, such that t 1 + |O| ≤ t, there exists a subset I of input indices that satisfies |I| ≤ t 1 , such that the t 1 intermediate variables and the output variables z |O can be perfectly simulated from x |I .
The main benefit of the t-SNI security definition is that it allows easy composition of gadgets [BBD + 16]. By proving the t-SNI property of individual gadgets, we obtain that the full circuit is secure against t probes, using n = t + 1 shares. In this paper we prove the t-NI or t-SNI property of all our gadgets. Note that a t-NI gadget is easily converted into t-SNI by applying a t-SNI mask refreshing as output (see [BBD + 16]), making it suitable for composition with other gadgets in a larger circuit. For our generic high-order table-based conversion algorithm (Section 3), the t-SNI security proof is essentially the same as in [CRZ18]. For our more specialized gadgets, the t-NI or t-SNI properties follow almost directly from the t-SNI of our generic conversion algorithm.

Generic high-order table-based conversion algorithm
In this section we introduce our new generic high-order table-based conversion algorithm, as an extension of the table recomputation countermeasure from [Cor14]. We consider two additive groups G and H and a function f : G → H. Our algorithm takes as input n shares x 1 , . . . , x n ∈ G and outputs n shares y 1 , . . . , y n ∈ H such that: We stress that the function f does not need to have any special property, except being efficiently computable. In particular it need not be a group homomorphism, as in general the groups G and H will not be homomorphic. Note that the high-order SBox computation algorithm from [Cor14] is a particular case with G = H = {0, 1} k and f (x) = S(x).
The algorithm consists in progressively shifting a randomized table T , using the input shares x 1 , . . . , x n−1 for the successive shifts. The randomized table T has |G| rows, and each row is a vector of n shares, which encodes over H the function f (x), but progressively shifted by x 1 , . . . , x n−1 ∈ G. Eventually one reads the table at index x n , which gives an n-sharing (y i ) over H of f (x 1 + · · · + x n ) as required. Between every shift, the n shares of every row are refreshed using the same mask refreshing as in [RP10], but over the group H.
As we will see in more details in Section 4, for a Boolean to arithmetic conversion algorithm, one will take G = {0, 1} k and H = Z 2 k . Then by identifying k-bit strings and integers modulo 2 k and taking f the identity function, we obtain y 1 + · · · + y n mod 2 k = x 1 ⊕ · · · ⊕ x n as required. Similarly, for an arithmetic to Boolean conversion, one takes G = Z 2 k and H = {0, 1} k and obtains y 1 ⊕ · · · ⊕ y n = x 1 + · · · + x n mod 2 k as required; see Section 6 for more details.
We provide a pseudocode description in Algorithm 1 above. The algorithm uses two temporary tables T and T in RAM, with |G| rows, where each row contains a vector of n elements in H. The table T is initialized at Line 1 with T (u) ← f (u), 0, . . . , 0) ∈ H n . Given an encoding v = (v 1 , . . . , v n ) with n shares in H, we denote by (v) = v 1 + · · · + v n the encoded element in H. This implies that initially we have (T (u)) = f (u) for all rows u ∈ G. For the first index i = 1, the table is shifted at Line 3 by x 1 into T , which gives (T (u)) = f (u + x 1 ) for all u ∈ G. Note that the shift is performed according to the
The Refresh H is the same as in [RP10] and [Cor14], except that we work in the group H instead of {0, 1} k . Note that the Refresh H algorithm is not SNI; the only required property is that any subset of n − 1 shares is uniformly and independently distributed.
Complexity. We assume that a group operation in G and H takes unit time, as well as randomness generation and table transfer. For n shares, the number of operations of Refresh H is 3n − 3. The time complexity of Algorithm 1 is therefore: The asymptotic complexity is therefore O(|G| · n 2 ). The memory complexity is O(|G| · n). The algorithm requires (n − 1) · (|G| · (n − 1) + 1) random elements.
Security. We prove that our algorithm achieves the t-SNI definition (Definition 2). One can therefore use our algorithm inside a more complex construction and achieve security against t probes with n = t + 1 shares. The proof is essentially the same as in [CRZ18] and is provided in the full version of our paper [CGMZ21].
Theorem 1 ((n − 1)-SNI of Convert G,H,f ). For any subset O ⊂ [1, n] and any t 1 intermediate variables with |O| + t 1 < n, the output variables y |O and the t 1 intermediate variables can be perfectly simulated from the input variables x |I , with |I| ≤ t 1 .

Table-based high-order Boolean to arithmetic conversion
In this section we consider the case of Boolean to arithmetic conversion. We first describe the straightforward application of the generic table-based conversion algorithm from G to H from Section 3. However the main drawback is that its complexity is O(|G| · n 2 ), where |G| is the group order. With k-bit Boolean masking as input we have G = {0, 1} k , and the complexity is therefore O(2 k · n 2 ). Fortunately for specific groups G this complexity can be reduced to O(k · n 2 ). In this section we consider the easiest case with the conversion from Boolean to arithmetic masking. We consider the other direction in Section 6.

Direct approach
We consider the straightforward application of Algorithm 1 to high-order Boolean to arithmetic conversion, which can be used for small values of k. We consider an integer q.
Note that q can be any integer, not necessarily a power of two. The (n − 1)-SNI security follows directly from Theorem 1.

Optimization of high-order Boolean to arithmetic conversion
The main drawback of the previous generic algorithm is that its complexity is O(2 k · n 2 ), which is prohibitive for large k, for example k = 32 in HMAC-SHA1. In this section we describe a simple optimization with complexity O(k · n 2 ). It consists in converting each bit of the k-bit input separately and adding the result.
Assume that we must convert a Boolean masking x = x 1 ⊕ · · · ⊕ x n ∈ {0, 1} k to arithmetic masking x = y 1 + · · · + y n (mod q). We write the binary decomposition of each x i as We now perform an independent tablebased Boolean to arithmetic conversion for each of the k variables x (j) . More precisely, applying Algorithm 3 on the Boolean shares x (j) i , we obtain n arithmetic shares y (j) i for each 0 ≤ j < k: This gives: and therefore letting y i := for all 1 ≤ i ≤ n, we obtain x = y 1 + · · · + y n (mod q) as required. The algorithm is formally described in Algorithm 5 below.

Algorithm 5 Optimized BooleanToArithmetic (BAopti)
Input: x 1 , . . . , x n ∈ {0, 1} k Output: y 1 , . . . , y n ∈ Z q such that x 1 ⊕ · · · ⊕ x n = y 1 + · · · + y n mod q 1: for i = 1 to n do y i ← 0 2: for j = 0 to k − 1 do 3: j) mod q 6: end for 7: return y 1 , . . . , y n Complexity. Algorithm 5 computes the sum of k applications of Algorithm 3 with 1-bit input. More generally, one can group the conversions by bits. The number of operations is then: One can see that it is a bit more advantageous to group by = 2 bits. In that case, we have N BAopti (k, n) 8k · n 2 . The memory complexity is O(n). Namely Algorithm 5 uses a table with only 2 rows ( = 1) or 4 rows ( = 2) of n-shared encodings, therefore our table-based approach has a small memory consumption. The algorithm requires 2k · n 2 random elements. Proof. The (n − 1)-SNI property follows from the (n − 1)-SNI of each of the k independent table-based conversions (Theorem 1). Namely the corresponding output shares y (j) i are combined independently for each share index 1 ≤ i ≤ n. Therefore we can use the same output subset O for each intermediate output shares (y

Comparison with existing techniques
k-bit Boolean to arithmetic modulo 2 k conversion. The k-bit Boolean to arithmetic modulo 2 k conversion is the classical case. We use k = 32, as for example in HMAC-SHA1. As shown in Table 2, our operation count is comparable to [CGV14] and [SPOG19]. For small orders t, it is, as for [CGV14] and [SPOG19], much less efficient than [BCZ18]. We refer to the full version of our paper [CGMZ21] for the operation count of [Gou01, BCZ18, CGV14, SPOG19]. 1-bit Boolean to arithmetic modulo 2 k conversion. The 1-bit Boolean to arithmetic conversion is useful in the context of ring-LWE encryption. Here we use k = 13, since this corresponds to the binomial sampling for Saber, which can be written as a sum modulo 2 k of 1-bit Boolean to arithmetic modulo 2 k conversions, as in [SPOG19]. We see in Table 3 that our operation count is comparable to [SPOG19], both methods having complexity O(n 2 ). Our operation count is an order of magnitude faster than [CGV14], which has complexity O(k · n 2 ). Namely the approach in [CGV14] requires to perform an arithmetic to Boolean conversion first, which has complexity O(k · n 2 ), so one cannot really take advantage of the 1-bit Boolean masking as input. 1-bit Boolean to arithmetic modulo q conversion. We use q = 3329, as this corresponds to the encryption of Kyber, and to the binomial sampling of Kyber. For [BBE + 18], we must use a word size k such that 2q < 2 k , so we take k = 13. As previously, our complexity is comparable to [SPOG19], and more than an order of magnitude faster than [BBE + 18].

Table-based shift of arithmetic masking
In this section we consider the table-based computation of a right shift over arithmetic shares. This can be used directly in Saber, and this will be used as a subroutine for the arithmetic to Boolean masking conversion (Section 6). We consider a parameter 1 ≤ < k. We consider the function f (x) performing a right shift of a k-bit integer x by bits, more precisely f : Our goal is to compute this right shift with arithmetic shares. More precisely, given as input z = z 1 + · · · + z n (mod 2 k ), we want to obtain arithmetic shares a 1 , . . . , a n ∈ Z 2 k− such that This right shift will be used for our table-based conversion from arithmetic to Boolean masking with complexity O(k · n 2 ). Namely we will perform a sequence of k/ right shifts by bits, each time converting a block of bits from arithmetic to Boolean masking. The goal of the right shift is to propagate the carry from one block to the next; this is a natural generalization of the carry propagation technique used in [CT03] for first-order table-based conversion.
If we are only interested in doing a right shift by bits (as in Saber), a basic approach using [CGV14] consists in first performing an arithmetic to Boolean conversion, then in doing an easy logical right shift by bits of the Boolean shares, and eventually in converting back the result to arithmetic modulo 2 k− . The complexity is then O(k · n 2 ), and therefore independent from .
In the following, we describe a table-based approach with complexity O( ·n 2 ). Therefore we expect the approach to be more efficient than [CGV14] for small values of . We actually describe a first technique with complexity O(2 ·n 3 ), and a second technique with complexity O(2 2 · n 2 ). To obtain a linear complexity in , in both cases we can perform a sequence of shifts by = 1 bit each. The first technique has then complexity O( · 2 · n 3 ) = O( · n 3 ), while the second technique has complexity O( · 2 2 · n 2 ) = O( · n 2 ). Because of a smaller constant in the O, we expect the first technique to be more efficient for small n.

First approach with complexity O(2 · n 3 )
We consider the function f : Z 2 k → Z 2 k− defined previously with f (x) = x/2 mod 2 k− , which corresponds to the k − most significant bits of x. We consider an arithmetic masking z = z 1 + · · · + z n (mod 2 k ) as input. Our goal is to obtain an arithmetic sharing This gives: and therefore: The previous equation shows that to compute f (z) (which corresponds to the k − most significant bits of z), we must compute the carry resulting from the addition of the -bit shares For this we apply our generic Algorithm 1 with inputs x 1 , · · · , x n , G = Z 2 k , H = Z 2 k− and f , and we obtain an arithmetic masking of the resulting carry: Combining (1) and (2), this gives: Therefore we have obtained an arithmetic sharing of the k − high-order bits of z. For a naive implementation of Algorithm 1, the complexity of this step would be O(|G| · n 2 ) = O(2 k · n 2 ). Therefore there would be no advantage compared to a generic table-based computation of the function f . However we note that since 0 ≤ x i < 2 for all i, we actually have 0 ≤ n i=1 x i ≤ n · (2 − 1) in (2). Therefore, when applying Algorithm 1, we do not need to store and randomize a full table with 2 k rows, as we can work with a much smaller table with B = n · (2 − 1) + 1 rows only. Thanks to this optimization the complexity becomes O(B · n 2 ) = O(2 · n 3 ). Moreover the table does not have to be cyclically shifted, only translated by x i for each 1 ≤ i ≤ n − 1; this implies that a single table in memory is sufficient. Our method is described below in Algorithm 6.
Complexity. The operation count is as follows: Algorithm 6 Shift1 Input: k ∈ N + , 1 ≤ < k and z 1 , . . . , z n ∈ Z 2 k Output: a 1 , . . . , a n ∈ Z 2 k− such that a 1 + · · · + a n = f (z 1 + · · · + z n ) (mod 2 k− ) The algorithm therefore requires 2 +1 · n 3 operations, neglecting low-order terms. By performing a sequence of shifts of 1-bit each, the number of operations is · N s1 (1, n) 2 · n 3 , neglecting low-order terms. The memory complexity is O(n 2 ), since the table has O(n) rows. More precisely, with = 1, the table has n rows of n-shared encodings, which corresponds to n 2 values in memory. In the next section we describe an alternative technique with memory complexity O(n) only. The number of random elements is (n−1)·n·(n+1)/2.

Security.
The algorithm only achieves the (n − 1)-NI property. To achieve the stronger (n − 1)-SNI property, one can apply a (n − 1)-SNI mask refreshing algorithm as output (see [BBD + 16]). Such mask refreshing has complexity O(n 2 ) only, so this does not change the asymptotic complexity O( · n 3 ).
Theorem 3 ((n − 1)-NI of Shift1). Any set of t 1 ≤ n − 1 intermediate variables can be perfectly simulated from the input variables z |I , with |I| ≤ t 1 .
Proof. The table-based conversion algorithm up to Line 9 is the same as the (n − 1)-SNI Algorithm 1, except that we are only performing a fraction of the computation. This implies that the adversary can only probe a subset of the variables, and therefore the algorithm remains (n − 1)-SNI, and therefore (n − 1)-NI. The global algorithm combines at Line 10 those output shares with a right shift of the input shares. Therefore the algorithm is (n − 1)-NI.

Second approach with complexity O(2 2 · n 2 )
We consider again the function f : Z 2 k → Z 2 k− defined previously with f (x) = x/2 mod 2 k− , for some parameter 1 ≤ < k. We consider two integers z 1 , z 2 ∈ Z 2 k , and we write: To compute the carry in the sum of x 1 and x 2 , we consider the function g : (Z 2 ) 2 → Z 2 k− , with: By propagating the carry from the sum of x 1 and x 2 , we get: We consider an arithmetic masking z = z 1 + · · · + z n (mod 2 k ). For all 1 ≤ i ≤ n, we write as previously z i = y i · 2 + x i with 0 ≤ x i < 2 . Equation (3) can be generalized to: Now applying Algorithm 1 with G = (Z 2 ) 2 , H = Z 2 k− and g, we can obtain the following arithmetic masking for all 1 ≤ j ≤ n − 1: Namely the input of g can be computed as a sum over the additive group Z 2 × Z 2 : Eventually by combining (4) and (5) we obtain: Therefore as previously we have obtained an arithmetic sharing of the k − high-order bits of z, which means that we can perform a shift by bits over the arithmetic shares of z.
For each 1 ≤ j < n, Equation (5) can be evaluated using Algorithm 1 with complexity O(|G| · j · n) = O(2 2 · j · n); namely there are j + 1 input shares instead of n. Therefore the total complexity of the first step would be O(2 2 · n 3 ), which is still cubic in n as previously. However it is possible to evaluate (5) in a more clever way. Namely we can keep the table randomization obtained up to j i=1 (x i , 0) when computing the new table randomization up to j+1 i=1 (x i , 0). The complexity then becomes O(2 2 · n 2 ). The algorithm is described formally below in Algorithm 7.
Security. The Shift2 algorithm only achieves the (n − 1)-NI property. To achieve the stronger (n − 1)-SNI property, as previously one can apply a (n − 1)-SNI mask refreshing algorithm as output, without changing the asymptotic complexity.
Theorem 4 ((n − 1)-NI of Shift2). Any set of t 1 ≤ n − 1 intermediate variables can be perfectly simulated from the input variables z |I , with |I| ≤ t 1 .
Proof. The security proof is very similar to the security proof of Theorem 1. It is easy to see that the computation of the shares c i at Line 6 achieves the (n − 1)-SNI property. Namely either a variable is probed between lines 4 and 5 and we include i ∈ I to get the knowledge of x i from z i , or no variable is probed and we can perfectly simulate any proper subset of shares at Line 5, thanks to the mask refreshing. The same holds at Line 6 with the knowledge of x i+1 .
Therefore the computation of the shares c i at Line 6 also achieves the weaker (n − 1)-NI property, and as in the proof of Theorem 3, the combination of shares computed at Line 7 remains (n − 1)-NI.

Comparison with existing technique
For the shift by bits of an arithmetic masking modulo 2 k , we perform a concrete comparison between the O(k · n 2 ) method using [CGV14] and our O( · n 2 ) method; see Table 5 below. As explained previously, when using [CGV14] (or [Gou01] at first-order), we first perform an arithmetic to Boolean conversion, then a right shift by bits of the Boolean shares, and eventually a Boolean to arithmetic conversion modulo 2 k− . We use k = 13 and = 3 as in Saber. We see that our table-based algorithm is more efficient. In particular, Algorithm 6 with complexity O( · n 3 ) is more efficient for small orders (up to t = 6), while Algorithm 7 with complexity O( · n 2 ) is more efficient for high orders. We consider the direct application of Algorithm 1 to high-order arithmetic to Boolean conversion. Given x 1 , . . . , x n ∈ Z q as input, we obtain y 1 , . . . , y n ∈ {0, 1} k as output, with x 1 + · · · + x n mod q = y 1 ⊕ · · · ⊕ y n . For this we have to assume that q ≤ 2 k , since the sum x 1 + · · · + x n mod q needs at least log 2 q bits for its representation. We provide the pseudocode description below.

Optimization for arithmetic modulo 2 k with secure shift
The main drawback of the previous generic algorithm is that its complexity is O(q · n 2 ), which is prohibitive for large q. In this section we describe an optimization for q = 2 k with complexity O(k · n 2 ) instead of O(2 k · n 2 ). The technique is based on the secure computation of the right shift from Section 5. We can use either Shift1 (Algorithm 6), with complexity O(k · n 3 ), or Shift2 (Algorithm 7), with complexity O(k · n 2 ).
Assume that we are given as input z = z 1 + · · · + z n (mod 2 k ) and we must compute s 1 , . . . , s n ∈ {0, 1} k such that s 1 ⊕ · · · ⊕ s n = z 1 + · · · + z n (mod 2 k ). For this we define a parameter 1 ≤ < k, and using one of the two Shift algorithms from Section 5, given as input the shares z 1 , . . . , z n ∈ Z 2 k , we obtain arithmetic shares a 1 , . . . , a n ∈ Z 2 k− such that: By definition we have z = z/2 · 2 + (z mod 2 ). Therefore letting x i = z i mod 2 for all 1 ≤ i ≤ n we can write: Equation (6) shows that we have actually obtained two independent arithmetic sharing: an arithmetic sharing (a i ) 1≤i≤n of the k − high-order bits of z, and an arithmetic sharing (x i ) 1≤i≤n of the low-order bits of z. One can then directly convert the arithmetic sharing (x i ) 1≤i≤n into Boolean masking using Algorithm 8 from Section 6.1, with complexity O(2 · n 2 ). This gives a Boolean masking of the low-order bits of z. The process can be applied recursively with the k − high-order bits of z, starting now from the arithmetic sharing (a i ) 1≤i≤n . Eventually one obtains a full Boolean masking of z. The algorithm is formally described in Algorithm 10 below.
Variant for = 1. When taking = 1 in Algorithm ABopti above, the ArithmeticTo-Boolean algorithm at Line 7 becomes a simple SNI mask refreshing, as arithmetic masking modulo 2 = 2 is equivalent to Boolean masking. We can therefore remove this step. In that case, the conversion algorithm becomes NI only, instead of SNI. More precisely, we obtain the following iterative algorithm.
The proof of the following theorem is similar to the proof of Theorem 5 and is therefore omitted.
Theorem 6 ((n − 1)-NI of ABoptiNI). Any t 1 intermediate variables can be perfectly simulated from the input variables z |I , with |I| ≤ t 1 .

Optimization with table in registers
We describe an optimization of Algorithm 8, where the j-th column of the table is stored in a single register R j for 1 ≤ j ≤ n. The cyclic shift of the rows of the table by input share x i then corresponds to a simple rotation of each register R j . In the following we consider the arithmetic to Boolean conversion with k bits as input and 1 bit as output, as will be used in Section 8.2 for the IND-CPA decryption of lattice-based encryption.
Since we must store every column of the table with 2 k rows in a single register, each register must have 2 k bits. We denote by R j [u] the u-th bit of register R j , for 0 ≤ u < 2 k and 1 ≤ j ≤ n. Then Line 1 of Algorithm 8 becomes R 1 [u] = f (u) for 0 ≤ u < 2 k , and R j = 0 for 2 ≤ j ≤ n. The rotation of the table at Line 3 becomes a rotation of all registers R j by x i positions to the right. The refreshing of the rows of the table at Line 4 becomes a mask refreshing of the shares (R 1 , . . . , R n ) with 2 k -bit random elements. Eventually we must read and refresh the row x n of the table (Line 6 of Algorithm 8), so we simply read the x n -th bit of each register R j . We refer to Algorithm 12 for a formal description. We denote by ROR[a](R) the cyclic rotation of a 2 k -bit register R by a bits to the right.
Obviously this optimization can only work for small values of k. In the comparison with existing techniques (sections 6.4 and 8.5), we use the following more realistic estimate of operation count, assuming a 32-bit processor. We assume that a register operation (or random generation) takes 1 operation for 32-bit (k = 5), and more generally 2 k−5 operations for 2 k bits, for k ≥ 5. The time complexity then becomes N ABreg (n, k) = 2 k−5 · (4n 2 − 3n). The number of 32-bit random elements is 2 k−5 · (n − 1) 2 + n − 1 for k ≥ 5.

Security.
We prove below the (n − 1)-SNI property of Algorithm 12. We stress that we do not put two shares from the same encoding into the same register. Otherwise the attacker could obtain information from multiple shares of the same encoding using a single probe on a given register, which would break the (n − 1)-SNI property. Proof. The proof is essentially the same as the proof of Theorem 1. The only difference is that by probing a register R j , the adversary gets the full j-th column of the table, instead of a single cell only. Such probe is simulated in the same way by putting the index j in J for every such probe.
Extensions. The technique is easily extended to arithmetic masking modulo any q as input, not only q = 2 k . In that case, one must perform two shifts for each register, instead of a single rotation for q = 2 k . Moreover, the technique is easily extended to k-bit Boolean masking as output, instead of 1-bit. In that case, one must use registers of size k · 2 k bits instead of 2 k .

Comparison with existing techniques
Arithmetic modulo 2 k to k-bit Boolean conversion. As in Section 4.3, we consider the classical case of arithmetic modulo 2 k to k-bit Boolean conversion, and we use k = 32 as in HMAC-SHA1. We see in Table 6 that in that case our table-based technique is less efficient than [CGV14], by a factor between 2 and 3. Arithmetic modulo 2 k to 1-bit Boolean conversion, for small k. As we will see in Section 8.2, arithmetic modulo 2 k to 1-bit Boolean conversion is interesting in the context of ring-LWE IND-CPA decryption, in order to compute the threshold function th : and th(x) = 0 otherwise. Such threshold function th can be computed directly using our Algorithm 12 from the previous section, since the algorithm works for any function f . Alternatively, to compute th with [CGV14], we write th(x) = th (x − 2 k−2 ) where th (x) = 1 if x ∈ [0, 2 k−1 ) and 0 otherwise. Thus, th (x) is the complement of the most significant bit of x. Therefore we first subtract 2 k−2 to the first arithmetic share of x, and perform the arithmetic to Boolean conversion from [CGV14]. Finally we extract the most significant bit of each Boolean share, and complement the first share; see the full version of our paper [CGMZ21] for more details. We see in Table 7 that for k = 6 (see Section 8 for a motivation of this choice of k), we obtain a significant improvement compared to [CGV14]. Table 7: Operation count for arithmetic modulo 2 k to 1-bit Boolean conversion algorithms, up to security order t = 12, with n = t + 1 shares and k = 6. For any positive integer q, we define r = r mod q to be the unique element r in the range [0, q[ such that r = r (mod q). For an even (resp. odd) positive integer q, we define r = r mod ± q to be the unique element r in the range −q/2 < r ≤ q/2 (resp. −(q − 1)/2 ≤ r ≤ (q − 1)/2) such that r = r (mod q). For x ∈ Q, we denote by x the rounding of x to the nearest integer, with ties being rounded up.

Ring-LWE IND-CPA encryption.
Let R and R q denote the rings Z[X]/(X d + 1) and Z q [X]/(X d + 1) respectively, for some d ∈ Z and an integer q. Let a ∈ R q be a public random polynomial. Let χ be a distribution outputting "small" elements in R, and let s, e ← χ. The public-key is t = as + e ∈ R q , while the secret-key is s. To CPA-encrypt a message m ∈ R with binary coefficients, one computes the ciphertext (c 1 , c 2 ) where with e 1 , e 2 , e 3 ← χ. To decrypt a ciphertext (c 1 , c 2 ), one first computes u = c 2 − s · c 1 , which gives: Since the ring elements e, e 1 , e 2 , e 3 and s are small, and the message m ∈ R has binary coefficients, we can recover m by rounding. Namely, for each coefficient of the above polynomial u, we decode to 0 if the coefficient is closer to 0 than q/2 , and to 1 otherwise. More precisely, we decode the message m as m = th(c 2 − s · c 1 ), where th applies coefficient-wise the threshold function: The distribution χ can be based on binomial sampling, which is easier to implement than the discrete Gaussian distribution [ADPS16]. More precisely, one can compute each polynomial coefficient as the difference between the Hamming weights of two random κ-bit strings, for some parameter κ.

Module-LWE IND-CPA encryption.
A public-key encryption scheme based on the module learning-with-errors problem (M-LWE) in module lattices [LS15] is parameterized by a ring R q , a module rank l and a distribution χ outputting "small" elements in R. In Kyber and Saber, we use R = Z[X]/(X d + 1) and R q = Z q [X]/(X d + 1), and χ is a distribution outputting polynomials with coefficients independently drawn from a centered binomial distribution of fixed parameter. We denote vectors and matrices by boldfaced variables. Let s and e be elements of R l sampled from χ l and A a uniformly random element of R l×l q . The public key is t = A · s + e ∈ R l q and the secret key is s. To CPA-encrypt a message m ∈ R with binary coefficients, one computes (c 1 , c 2 ) ∈ R l q × R q such that where r and e 1 are sampled from χ l and e 2 from χ. To decrypt a ciphertext (c 1 , c 2 ), one computes: The last approximation holds because elements sampled from χ are small. To recover the original message m, as previously one applies coefficient-wise the threshold function th.
CCA-secure KEM. Both Kyber and Saber aim to actually construct a CCA-secure key encapsulation mechanism from their IND-CPA encryption. They both use a variant of the FO transform [FO99]. Since the details do not matter for masking, we describe a simplified version. To encrypt a random session key K, one first generates a random message m; this message is then encrypted with the IND-CPA encryption scheme using a random tape r = H 1 (m) derived from the message itself. The ciphertext is still c = (c 1 , c 2 ), while the session-key is K = H 2 (m, c). To decrypt, one first recovers m from (c 1 , c 2 ); one can then re-encrypt m using the same randomness r = H 1 (m) to get a new ciphertext c ; one then checks that c = c ; in that case one outputs K = H 2 (m, c), and ⊥ otherwise.

Specificities of Kyber.
Kyber instantiates the M-LWE-based encryption scheme described above with d = 256, a prime q = 3329 and a centered binomial distribution of parameters 2 or 3 [ABD + 21]. The designers also introduced a compression function for reducing the size of the ciphertext. Indeed, since the decryption has to tolerate a certain amount of error to properly recover the message, a trade-off between correctness and ciphertext size can be made by purposefully dropping some low-order bits of (c 1 , c 2 ). The compression function is defined as Compress q (x, a) = (2 a /q) · x mod 2 a and inverted by Decompress q (x, a) = (q/2 a ) · x . Those functions are defined on scalars, and extended to polynomials and vectors of polynomials by applying them separately on each coefficient. We also note that the compression function with a = 1 is used to recover the message at the end of the IND-CPA decryption. Eventually, since q is a prime such that q − 1 is divisible by d, for efficiency reasons, the number theoretic transform (NTT) can be used for polynomial multiplication [PG13]. Thus, the public-key, private-key and some part of the ciphertext are transmitted in the NTT domain. The NTT transform being a linear operation, it is usually not relevant for the definition of the masking scheme, so for simplicity we ignore it for the rest of the paper.

Specificities of Saber. Saber is based on the hardness on the Module Learning With
Rounding (M-LWR) problem [BMD + 21]. In a nutshell, instead of explicitly adding error terms (e, e 1 , e 2 ) sampled from the distribution χ, errors are deterministically added by applying a rounding to the value. For example, if the public-key is of the form t = A · s + e in an LWE-based scheme, the LWR equivalent will use t = A · s p , with · p a rounding function mapping Z q to Z p with p < q. In the case of Saber, both p and q are powers of two and the rounding function is basically a shift extracting the log 2 (p) most significant bits of its input. This modulus switch from a power of two modulus to another one is also used to compress the ciphertext, and eventually to decode the noisy message.

Masking lattice-based encryption scheme
Masking IND-CPA decryption. We consider the masking of ring-LWE decryption against side-channel attacks. The secret-key s ∈ R is initially masked with n shares using s = s 1 + · · · + s n (mod q) where s i ∈ R q for all 1 ≤ i ≤ n. Given as input a ciphertext (c 1 , c 2 ), instead of computing u = c 2 − s · c 1 and then m = th(u) coefficient-wise, in the first step we can write: where u 1 = c 2 − s 1 · c 1 and u i = −s i · c 1 for all 2 ≤ i ≤ n. Therefore we have obtained an arithmetic sharing of u modulo q. In the second step, one must compute n Boolean shares m i of the message m = m 1 ⊕ · · · ⊕ m n such that m 1 ⊕ · · · ⊕ m n = th(u 1 + · · · + u n ) without leaking information about u = c 2 − s · c 1 in the process. Otherwise knowing u the adversary could recover the secret-key as s = (c 2 − u)/c 1 mod q. Computing the threshold function th securely over the shares u i as in (8) is non-trivial, because th is a non-linear function from Z q to {0, 1}. Moreover, as observed in [OSPG18], one should not eventually recombine the shares m i into m, since otherwise knowing m the adversary can launch a CCA attack and recover the private-key s. More precisely, under a simplified version of the FO transform and with ring-LWE encryption as in (7), in the second step the message m is re-encrypted using error polynomials (e 1 , e 2 , e 3 ) = H 1 (m) to get a new ciphertext c , and one then checks that c = c before outputting the key K = H 2 (m, c).

Masking
We have considered the masked implementation of Step 1 (IND-CPA decryption) in the previous paragraph. As observed previously, the message m recovered at Step 1 must be kept in shared form with the shares m i obtained from (8), otherwise knowing m = m 1 ⊕ · · · ⊕ m n the adversary could launch a CCA attack. Therefore at Step 2 the hash-function H 1 (m) should be masked, and the binomial sampling for generating (e 1 , e 2 , e 3 ) = H 1 (m) should also be masked, which gives a masked re-encrypted ciphertext c . Eventually the polynomial comparison between c and c should be masked, and if c = c one must return a masked session key K = H 2 (m, c).
Since the IND-CCA decryption algorithm combines arithmetically masked values (such as s, e 1 , e 2 , e 3 and c ) and Boolean masked values (such as m and K), this requires conversions between Boolean and arithmetic masking. In this paper, for the masking of ring-LWE encryption, we only consider the high-order masking of IND-CPA decryption (Step 1), and the re-encryption of m with the binomial sampling (Step 2). Namely our goal is to show that our table recomputation technique for high-order conversions is particularly effective in the context of ring-LWE encryption. We leave for further work the masking of the polynomial comparison (Step 3), and the description of a fully masked IND-CCA ring-LWE scheme secure against high-order attacks.
Finally, for simplicity we will focus on the high-order masking of operations performed on single elements in Z q , as for example the computation of the threshold function th : Z q → {0, 1}. Namely when dealing with polynomials as in ring-LWE scheme, and additionally with vectors and matrices of polynomials as in M-LWE and M-LWR schemes (as with Kyber and Saber), these operations are performed coefficient-wise and componentwise. Therefore the corresponding algebraic structure is irrelevant for the description of the masking scheme.

Overview
As explained in Section 7.2, for masking the ring-LWE IND-CPA decryption, we must compute a threshold function th over arithmetic shares modulo q, with 1-bit Boolean shares as output. Namely for Kyber we must compute the threshold function th : Z q → {0, 1} with More precisely, given as input arithmetic shares x 1 , . . . , x n ∈ Z q , we must compute 1-bit Boolean shares y 1 , . . . , y n ∈ {0, 1} such that y 1 ⊕ · · · ⊕ y n = th(x 1 + · · · + x n mod q) The computation of the threshold function for Saber is similar (see the full version of our paper [CGMZ21]). For computing the threshold function th, we could apply our generic conversion algorithm from Section 3 with G = Z q , H = {0, 1} and the function f : Z q → {0, 1} with f = th. In that case, the time complexity is O(q · n 2 ), and the memory consumption is O(q · n). Both can be prohibitive for large q, for example with q = 3329 in Kyber, or with q = 2 10 in Saber.
We describe in the following an optimized technique based on modulus switching to a smaller modulus 2 , which enables to apply the fast table-based variant from Section 6.3. For this we slightly modify the decryption algorithms of Kyber, with a negligible increase in the decryption failure probability. We explain in Section 8.4 why the security proof for Kyber remains perfectly valid. Namely the IND-CCA security proof only depends on the decryption failure probability, and not on the specific decryption algorithm used. In other words, one can use any decryption algorithm, as long as the decryption failure probability remains negligible. We maintain a total decryption failure probability δ ≤ 2 −128 to guarantee the same level of security as in the original scheme, against both classical attacks and quantum attacks.

Threshold arithmetic modulo q to 1-bit Boolean
Our goal is to compute the function th : Z q → {0, 1} given by (9) but this time we require a correct computation of th(x) only for a large subset of Z q , not necessarily for the full Z q . More precisely, we require correct computation only for values of x which are not too close to the thresholds ±q/4, by a relative factor ∆. More precisely we require correct decryption for x ∈ R q,∆ with: This means that there will be a small subset of Z q for which the computation of the function th can be incorrect. We will see that for lattice-based schemes such as Kyber and Saber, the probability that x / ∈ R q,∆ is negligible for small enough ∆, and therefore the decryption error will remain negligible for these two schemes.
Our goal is therefore to compute output shares b 1 , . . . , b n ∈ {0, 1}, such that when given input shares x 1 , . . . , x n ∈ Z q such that x = x 1 + · · · + x n ∈ R q,∆ , we are guaranteed to obtain a correct result, that is: For this our strategy is to first perform a modulus switching into an arithmetic masking modulo a smaller 2 , and then to perform the (easier) conversion from arithmetic masking modulo 2 to Boolean masking via a threshold function f over Z 2 . More precisely, we first perform a modulus switching of all input shares x i , by computing y i = x i · 2 /q for all 1 ≤ i ≤ n. Note that this modulus switching can be computed by writing and therefore y i is the quotient of the Euclidean division of x i · 2 +1 + q by 2q. We obtain: where |ε i | ≤ 1/2 for all 1 ≤ i ≤ n. This gives: Therefore we have obtained an arithmetic masking of y ∈ Z 2 where y = 2 · x/q + ε (mod 2 ) and the error ε ∈ R is such that |ε| ≤ n/2. In the second step, we apply the generic conversion algorithm from Section 3 with G = Z 2 , H = {0, 1} and the function f : Our algorithm is formally described in Algorithm 13 below.
Algorithm 13 Arithmetic modulo q to 1-bit Boolean conversion (ThresholdAtoB) Input: Correctness and complexity. The following lemma proves the correctness of Algorithm 13 when the threshold function th(x) is computed on x ∈ R q,∆ , under the condition n ≤ 2 +1 · ∆.
From Lemma 1, it suffices to select an intermediate modulus 2 with = log 2 (n/∆) − 1 (12) to ensure correct computation of th(x) for x ∈ R q,∆ . The complexity of the arithmetic to Boolean conversion at Line 2 is therefore O(2 · n 2 ) = O(n 3 ) using the generic conversion (Algorithm 8). Using the optimized arithmetic to Boolean conversion (Algorithm 10), the complexity becomes O( · n 2 ) = O(n 2 · log n). The memory complexity remains O(n). Finally, using the optimization with table in registers from Section 6.3, the complexity is O(n 2 ) only, assuming that operations on registers of size 2 take unit time.
Security. The previous algorithm achieves the (n − 1)-SNI property, thanks to the (n − 1)-SNI property of the Convert algorithm.

Application to ring-LWE IND-CPA decryption
In this section we show how to efficiently mask the IND-CPA decryption of ring-LWE schemes. We explain how to tune the value ∆ used in the definition of R q,∆ in (10) so that the decryption error remains negligible for Kyber.
As explained in Section 7.2, for masking the ring-LWE IND-CPA decryption, the secretkey s ∈ R is initially masked with n shares using s = s 1 + · · · + s n (mod q) where s i ∈ R q for all 1 ≤ i ≤ n. Given as input a ciphertext (c 1 , c 2 ), instead of computing u = c 2 − s · c 1 and then m = th(u) coefficient-wise, in the first step we compute u 1 = c 2 − s 1 · c 1 and u i = −s i · c 1 for all 2 ≤ i ≤ n, which gives an arithmetic sharing of u = u 1 + · · · + u n ∈ R q . Therefore, in the second step, by applying Algorithm 13 coefficient-wise on the polynomial shares u i ∈ R q , we obtain n boolean shares m i of the message m = m 1 ⊕ · · · ⊕ m n such that m 1 ⊕ · · · ⊕ m n = th(u 1 + · · · + u n ), as required. To ensure a negligible decryption error, we must therefore ensure that all coefficients of u belong to the set R q,∆ considered in the previous section, except with negligible probability.
Application to Kyber. The authors of the Kyber submission provide a Python script computing a tight upper bound on the decryption error probability δ. Following [HHK17], we say that PKE = (KeyGen, Enc, Dec) is (1−δ) correct if E[max m∈M Pr[Dec(sk, Enc(pk, m)) = m]] ≥ 1 − δ, where the probability is over the randomness of Enc, and the expectation is over (pk, sk) ← KeyGen(). More precisely, for an encryption of 0, the authors compute an upper-bound on the probability that any coefficient of u = c 2 − s T · c 1 is greater than q/4 in absolute value. From the definition of the set R q,∆ in (10), it suffices to rerun the script with the bound q · (1/4 − ∆) instead, in order to obtain the new decryption failure probability. We refer to the full version of our paper [CGMZ21] for more details.
For our implementations, we choose to take ∆ = 0.02 for the recommended parameters of Kyber; this gives a decryption failure probability δ = 2 −137 , instead of 2 −164 originally (see Table 9). We argue in Section 8.4 that Kyber remains secure with this increased decryption failure probability. We provide in Table 8 the value of the register size as a function of the number of shares n for ∆ = 0.02, according to Condition (12). n 2 3 4 5 6 7 8 9 10 6 7 7 7 8 8 8 8 8 Moreover, we show in Table 9 that the decryption failure probability is easily decreased by modifying the compression parameters (d u , d v ), which does not affect the security analysis of Kyber 2 . More precisely, by using the same compression parameters (d u , d v ) = (11, 5) as for Kyber1024, we obtain for ∆ = 0.02 a decryption error probability δ = 2 −192 , that is smaller than originally in Kyber768. Application to Saber. For Saber, the original decryption failure probability is 2 −136 , and with ∆ = 0.02 the failure probability becomes 2 −112 . We can reach δ = 2 −128 with ∆ = 0.007, but in that case there is no performance improvement. However, we can slightly modify the scheme parameters to reach a decryption failure probability δ ≤ 2 −128 , still with ∆ = 0.02. More precisely, we can increase the parameter T = 2 4 to T = 2 6 as in the more secure FireSaber 3 . This enables to reach a decryption failure probability δ = 2 −138 for ∆ = 0.02; see Table 10. In that case, we can use the same values as in Kyber for the register size as a function of the number of shares n (see Table 8).

Security impact for ring-LWE IND-CCA encryption
In this section we consider the security impact of increasing the decryption failure probability δ. Namely, as illustrated in Table 9, for the Kyber768 parameters the decryption failure probability becomes δ = 2 −137 instead of δ = 2 −164 , so we must explain why the Kyber scheme remains secure. For this we follow closely the analysis from [ABD + 21, Section 5.5].
Classical security. We recall the CCA security of Kyber against classical adversaries, based on the Fujisaki-Okamoto transform, with a security bound that includes the decryption failure probability δ.
Theorem 8 (CCA security of Kyber [ABD + 21]). Suppose XOF, H, and G are random oracles. For any classical adversary A that makes at most q RO many queries to random oracles XOF, H and G, there exist adversaries B and C of roughly the same running time as that of A such that Adv cca Kyber.CCAKEM ≤ 2Adv mlwe k+1,k,η (B) + Adv prf PRF (C) + 4q RO · δ We note that the above security bound does not depend on the specific decryption algorithm used. This means that modifying the decryption algorithm (as we did in the previous section) does not invalidate the security proof of Kyber, as long as the decryption failure probability δ remains negligible. From the above security bound, with δ = 2 −137 , the best strategy to generate a decryption failure is to make 2 137 decryption or random oracle queries. This makes a classical attack completely unpractical.
Quantum security and failure boosting. In the quantum random oracle model, the security bound is non-tight and includes a term q 2 RO · δ; see [ABD + 21, Theorem 3]. Namely in the quantum setting the search for a m provoking a decryption failure can be quadratically accelerated using Grover's algorithm. In [ABD + 21], the authors consider a failure boosting attack strategy that uses Grover's algorithm in an offline phase to search for a polynomial pair (e 1 , r) with a larger norm, so that it is more likely to produce a decryption error. Below we use the same reasoning to estimate the quantum complexity of the attack, with decryption failure probability δ = 2 −137 instead of 2 −164 .
The polynomial pair (e 1 , r) is seen as a vector in Z 1536 distributed as a discrete Gaussian with standard deviation σ = η 1 /2 = 1. We have that a m-dimensional vector v under such distribution satisfies for any κ > 1: Grover's algorithm is used to search this space with a quadratic speed-up, so with complexity κ −m/2 · exp(m(κ 2 − 1)/4). In the second step, a decryption failure occurs if z, v is large enough for the secret vector z. If z is distributed as a Gaussian with standard deviation σ , then for any λ, we have Pr[ z, v > λσ v ] ≤ 2 exp(−λ 2 /2). For a vector v without the failure boosting, we therefore have δ 2 exp(−λ 2 /2), which gives λ 13.8 for δ = 2 −137 . Thanks to the failure boosting, we get a v whose norm is larger by a factor κ, so we can use λ = λ/κ instead of λ. The improved decryption failure probability after Grover's search then becomes 2 exp(−(λ/κ) 2 /2), which gives a total complexity κ −m/2 · exp(m(κ 2 − 1)/4 + (λ/κ) 2 /2). For δ = 2 −137 , this is minimized for κ = 1.1, with total complexity 2 124 (instead of 2 150 for δ = 2 −164 ). Therefore the attack remains completely unpractical. We refer to [ABD + 21] for a discussion on the more recent attacks based on decryption failure [BS20,DRV20]; their overall running time for Kyber are no better than the above attack. In particular, the multi-target attack considered in [DGJ + 19] is prevented in Kyber by hashing the public key pk into r and e 1 . Table 11: Operation count for arithmetic modulo q to 1-bit Boolean conversion algorithms, up to security order t = 10, with n = t + 1 shares, for Kyber and Saber, with q = 3329 for Kyber and q = 2 10 for Saber. For Algorithm 13, we use the values of from  [CGMZ21] how to compute the threshold function modulo 2 k based on the arithmetic to Boolean conversion algorithm from [CGV14]. For Kyber, we explain in the full version of our paper [CGMZ21] how to compute the threshold function modulo q, based on the arithmetic modulo q to Boolean conversion from [BBE + 18]. For our Algorithm 13, we use the register optimization (Algorithm 12) to perform the arithmetic modulo 2 to 1-bit Boolean conversion, according to the values of from Table 8. As in Section 6.3, we assume that a register operation takes 1 operation for 32-bit ( = 5), and 2 −5 operations for 2 -bit, for ≥ 5. We see in Table 11

Binomial sampling and masked ring-LWE re-encryption
In this section, we show that our techniques enable to efficiently mask the re-encryption of ring-LWE encryption schemes. As recalled in Section 7.2, under a simplified version of the FO transform for IND-CCA decryption, in the second step the message m is re-encrypted using error polynomials (e 1 , e 2 , e 3 ) = H 1 (m) to get a new ciphertext c . To encode a Boolean masked message m ∈ {0, 1} as in (7), we can use our generic table-based conversion algorithm, with the function f : {0, 1} → Z q with f (x) = q/2 · x mod q. In that case the complexity is O(n 2 ) as in [SPOG19]. Consider a single error e, which we write e = H(m) for some hash function H; for simplicity we focus on a single component e ∈ Z. The error e is actually computed using binomial sampling, with (α, β) = H(m) and then e = h(α) − h(β), where α, β ∈ {0, 1} k and h is the Hamming weight function. The message m is Boolean masked, and therefore the variables α and β are Boolean masked, while the error e must be arithmetically masked modulo q.
For masking the binomial sampling we must therefore mask the Hamming weight computation, with Boolean masking as input and arithmetic masking modulo q as output. Our approach is similar to [SPOG19]: we start from our 1-bit Boolean to arithmetic masking modulo q algorithm from Section 4.1 (with k = 1), and for α ∈ {0, 1} k , the Hamming weight of α is computed as the sum of k independent 1-bit Boolean to arithmetic masking modulo q conversions. Starting from a Boolean masked message m = m 1 ⊕· · ·⊕m n , we then obtain an arithmetically masked ciphertext with n shares modulo q. Since our table-based approach has a similar level of efficiency as the technique from [SPOG19] (see Table 4 in Section 4.3 for a comparison), for the binomial sampling we obtain a similar level of efficiency as in [SPOG19], and an order of magnitude improvement compared to [BBE + 18].
In the following, we describe in more details the technique to securely compute the Hamming weight and the binomial sampling, and we show how to perform masked IND-CPA encryption.

Masked Hamming weight computation
We consider the Hamming weight function h : {0, 1} k → Z where h(x) is the sum over Z of the bits of x, and the function h q : {0, 1} k → Z q where this sum is computed modulo q, that is h q (x) = h(x) mod q. Given as input x 1 , . . . , x n ∈ {0, 1} k , our goal is to compute arithmetic shares a 1 , . . . , a n ∈ Z q such that: The technique is similar to the optimized Boolean to arithmetic conversion algorithm from Section 4.2. We let x = x 1 ⊕ · · · ⊕ x n and we write x (j) the j-th bit of x for 0 ≤ j < k, which gives h q (x) = k−1 j=0 x (j) . We also denote by x (j) i the j-th bit of each share x i . We obtain: We now perform an independent table-based Boolean to arithmetic conversion for each of the k variables x (j) , namely we write for each 0 ≤ j < k: This gives: and therefore letting a i := k−1 j=0 y (j) i for all 1 ≤ i ≤ n, we obtain h q (x 1 ⊕ · · · ⊕ x n ) = a 1 + · · · + a n (mod q) as required. The algorithm is formally described in Algorithm 14 below. The complexity of the algorithm is O(k · n 2 ). As for Algorithm 5, the (n − 1)-SNI property of the algorithm follows from the (n − 1)-SNI of each of the k independent table-based conversions.

Application to binomial sampling and masked IND-CPA encryption
We consider the high-order masking of ring-LWE IND-CPA encryption, using error polynomials (e 1 , e 2 , e 3 ) = H 1 (m) from a message m ∈ R with binary coefficients: c 1 = a · e 1 + e 2 c 2 = t · e 1 + e 3 + q/2 · m We are given as input a Boolean masked message m = m 1 ⊕ · · · ⊕ m n ∈ R and we must output an arithmetically masked ciphertext modulo q. Applying our generic conversion algorithm with the function f : {0, 1} → Z q with f (x) = q/2 · x mod q on each coefficient separately, we obtain arithmetic shares M 1 , . . . , M n ∈ R q such that: Similarly, each component e ∈ Z of the error polynomials e 1 , e 2 , e 3 is equal to e = h q (α) − h q (β) mod q, where α, β ∈ {0, 1} k and h q is the Hamming weight function modulo q. Starting from n-shared Boolean masking of α and β, we can therefore apply Algorithm 14 to generate n arithmetic shares for e modulo q. Eventually, we obtain arithmetically masked error polynomials E ji ∈ R q for j = 1, 2, 3 such that Finally, we can compute the n shares of the ciphertext: and we have n i=1 c 1,i = c 1 (mod q) and n i=1 c 2,i = c 2 (mod q) as required. Therefore we have obtained a masked ciphertext with n shares modulo q. The complexity is O(n 2 ) for n shares.

Practical implementation
We have performed a plain C implementation of our techniques, and of [CGV14, BBE + 18] and [SPOG19] for comparison. In the following, we provide tables containing the average cycle count for each gadget over 1 000 000 executions, running on an Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz processor of a laptop, for security orders 1 ≤ t ≤ 9. Randomness generation has been performed using a simple xorshift PRNG. Still we provide in Table 16 the number of random elements needed for all considered techniques to fairly highlight and compare the randomness usage.

Conversion from 1-bit Boolean to arithmetic modulo q
The conversion from 1-bit Boolean to arithmetic modulo q is a core component of the IND-CPA encryption since it is used to both encode the message m and to protect the binomial sampling (see Section 9.2). Table 12 shows a comparison between our technique depicted in Algorithm 3 (BooleanToArithmetic) and [SPOG19], without bitslicing. We only compare with [SPOG19] since according to Table 4, [BBE + 18] is much less efficient. We see that we get a similar level of efficiency, as it was also the case in the operation count from Table 4.

Arithmetic shift by bits
We have implemented the Shift1 algorithm (Algorithm 6), which requires a table of size B = n · (2 − 1) + 1 rows for securely computing the carry value 0 ≤ c ≤ c max with c max = n · (2 − 1)/2 . Recall that in Algorithm 6 the carry is arithmetically masked modulo 2 k− , see Equation (2). Therefore, in principle one cannot apply the register optimization from Section 6.3, since such optimization requires a Boolean masked output. However we used the following trick. Instead of storing the carry c directly in arithmetic masking as in Algorithm 6, we first use a Boolean masking for storing c. This requires n Boolean masks as output, with log 2 c max bits each to represent the carry. With a Boolean masking as output, we can now apply the arithmetic to Boolean conversion with register optimization from Section 6.3, with registers of size B · log 2 c max bits. Eventually we convert the carry from Boolean to arithmetic masking modulo 2 k− , using the BAopti algorithm (Algorithm 5). Therefore, in Algorithm 6 the arithmetic refreshes are replaced by Boolean refreshes (using the register optimization from Section 6.3), and a Boolean to arithmetic conversion is performed before Line 10. With = 1, we can use 8-bit registers for n = 2, 16-bit registers for n ≤ 7, and 64-bit registers for n ≤ 15. We also considered = 3 as in Saber, with 16-bit registers for n = 2, and 64-bit registers for n = 3, 4. For other values of , we can iterate multiple Shift1 with = 1.
We provide the implementation results in Table 13 with input arithmetic values modulo 2 k , with k = 13 as in Saber. Recall that the [CGV14] technique is not sensitive to the number of shifts . We see that for = 1 and = 3 our technique outperforms [CGV14], especially for small security orders. For = 3, we see a significant gap in performances between order 3 and order 4. Namely for order 3 (n = 4) we can store the carry in 64-bit registers as explained above, while for higher order we simply use 3 iterations with = 1.

Arithmetic modulo 2 k to k-bit Boolean conversion
We consider the classical arithmetic modulo 2 k to k-bit Boolean conversion, for small k, which is depicted in Algorithm 8 (ArithmeticToBoolean). This conversion is a building block of the optimized arithmetic to Boolean conversion of arbitrary size of Section 6.2. Indeed, Algorithm 10 relies on a smaller conversion for each block of bits. For small values of k, as previously we can use the register optimization from Section 6.3 with a register size of k · 2 k bits. We can work with the conversion Z 16 → {0, 1} 4 , i.e. with k = 4 since the table fits in a 64-bit register. Indeed, using an uint64_t to store each share of the table, the conversion simply consists in cyclically shifting those integers and XORing them with random 64-bit values. Table 14 shows that this technique for small k is more efficient than the conversion of [CGV14]. We recall that for large k, it will however not scale well since the size of the table is exponential in the number of bits converted. Namely, for large k, one should use our optimized algorithm from Section 6.2.

Threshold decryption for Kyber and Saber
We have implemented the threshold decryption function th for Kyber and Saber, which is a core component of the IND-CPA decryption. For Kyber we have used the approach described in Section 8.2 with Algorithm 13, where we first perform a modulus switching to a smaller modulus 2 , and then compute the threshold function f : 2 → {0, 1} from arithmetic to Boolean masking. We see in Table 8 that the function must be computed for = 6, 7, 8 for a number of shares ≤ 11. For = 6, we can therefore use the register optimization from Section 6.3, and the table with 2 = 64 rows can fit in a physical register or at least be smoothly managed by the compiler using the appropriate data type. The cases = 7, 8 are somewhat trickier. Even if some architecture might support 128-bit or 256-bit integers, these widths are less common. In those cases, we use two or four 64-bit registers and simulate the shift on 128 or 256 bits with 64-bit instructions. The results in Table 15 shows that for both Saber and Kyber our approach significantly outperforms [CGV14] and [BBE + 18].

Randomness usage
Due to the refresh gadgets, the practical performance of masking schemes is strongly impacted by the speed of the RNG. Table 16 shows the number of RNG calls outputting 32-bit values for each execution of the gadgets considered in the previous sections. We assume that the RNG always outputs exactly 32 bits, which means that a fresh 16-bit value also counts for one call whereas a fresh 64-bit value counts for two.

Conclusion
We have described a new high-order conversion algorithm between Boolean and arithmetic masking, based on a generalization of the table recomputation countermeasure from [Cor14]. For classical k-bit to k-bit conversions, the new algorithm offers a similar level of efficiency as in [CGV14]. For 1-bit Boolean to arithmetic modulo q conversion, the new algorithm offers a similar level of efficiency as in [SPOG19]. For the computation of a threshold function from arithmetic to 1-bit Boolean masking (as used in the IND-CPA decryption), we have obtained for Kyber at least an order of magnitude improvement compared to the state of the art, thanks to a new modulus switching technique over arithmetic shares, and an optimization of the table recomputation in registers. This was confirmed by the results of a practical implementation. We think that the main advantage of our high-order table-based conversion algorithm is its flexibility: we can start from any masking as input (either Boolean, or arithmetic modulo 2 k or any q), compute any function f (for example threshold or shift), and obtain any masking as output, with a good level of efficiency, and with a simpler implementation than with existing techniques.