High-order masking of NTRU

The main protection against side-channel attacks consists in computing every function with multiple shares via the masking countermeasure. While the masking countermeasure was originally devel-oped for securing block-ciphers such as AES, the protection of lattice-based cryptosystems is often more challenging, because of the diversity of the underlying algorithms. In this paper, we introduce new gadgets for the high-order masking of the NTRU cryptosystem, with security proofs in the classical ISW probing model. We then describe the ﬁrst fully masked implementation of the NTRU Key Encapsulation Mechanism submitted to NIST, including the key generation. To assess the practicality of our countermeasures, we provide a concrete implementation on ARM Cortex-M3 architecture, and eventually a t -test leakage evaluation.


Introduction
Post-quantum cryptography. The RSA and ECC cryptosystems rely on the hardness of the integer factorization and the discrete logarithm problems respectively. These problems, which we can assume to be hard on a classical computer, are however vulnerable to a quantum one. Peter Shor in 1995 has indeed designed an algorithm running on a quantum computer that ensures a polynomial-time solution. In light of these new threats, the National Institute of Standards and Technology (NIST) initiated in 2016 a standardization process for post-quantum cryptography that has reached its last round.
The NTRU cryptosystem. The NTRU cryptosystem was introduced in 1996 by Hoffstein, Pipher and Silverman [HPS98] covering both encryption and signature. Its security relies on the problem of finding small solutions to a system of linear equations over polynomial rings, which is assumed to remain hard even in the presence of a quantum computer. Therefore, it is closely related to the Shortest and Closest Vector Problems (SVP/CVP) in lattices. Despite not being equivalent neither to SVP nor CVP, the NTRU cryptosystem nonetheless resisted more than two decades of cryptanalysis. Moreover variants of NTRU were proven to be secure in the (Quantum) Random Oracle Model under the Ring Learning With Error (R-LWE) hardness assumption [SS13]. In terms of performance, NTRU is known to be currently one of the fastest public key cryptosystem altogether with moderate key-size, making it a reasonable choice for embedded cryptography. Its performance granted it several standards, e.g., IEEE Std 1363.1, X9.98 and PQCRYPTO. Recently NTRU was one of the finalists of the NIST post-quantum cryptography standardization effort; the Kyber algorithm has finally been selected for standardization.
Side-channel attacks and the masking countermeasure. As for any other cryptosystem, an NTRU implementation on embedded device is vulnerable to side-channel attacks. These attacks exploit physical leakages happening during the execution of the algorithm to recover the key. We refer to [PPM17,HCY20,XPRO20,EMVW22] for examples of such attacks. Side-channel attacks can be prevented by using the masking countermeasure. It consists in splitting each secret variable into shares, for x = x 1 ⊕ · · · ⊕ x n with Boolean masking. Then by processing each share independently, any leakage on at most n − 1 shares x i will not reveal information about the secret x. Formally, in this paper we consider the classical probing model introduced in [ISW03], with an attacker being able to probe any set of at most t variables in the circuit. The authors showed that using at least n = 2t + 1 shares, one can transform any Boolean circuit C into a circuit C of size O(|C| · t 2 ), such that an adversary with t probes on C is no more powerful that an adversary with no probe at all. Later, finer notions of security were formalized by Barthe et al. in [BBD + 16], who introduced the notions of (Strong) Non-Interference NI/SNI. This enables to reach t-probing security with n = t + 1 shares only, via a composition theorem.
While any encryption scheme can be written as a Boolean circuit and then protected using the above transform, in practice that would be quite inefficient. Indeed, lattice-based cryptography usually requires to perform both Boolean and arithmetic operations, and moreover, the NTRU cryptosystem combines arithmetic operations modulo q = 2 k and modulo 3. It is therefore more efficient to mask some intermediate variables with arithmetic masking modulo q or modulo 3, instead of with Boolean masks only. One must therefore repeatedly convert between these masked representations.
The first conversions between Boolean and arithmetic masking were described in [Gou01] for firstorder security. It was then generalized to higher order in [CGV14], with complexity O(n 2 · k) for n shares and k-bit words. Recently, a generic conversion algorithm was described in [CGMZ22], based on table-recomputation. It allows to high-order compute any function f : G → H between two groups G and H, with complexity O(|G| · n 2 ). For example, by taking G = Z 3 and H = Z q , one can efficiently convert from arithmetic masking modulo 3 to arithmetic masking modulo q, which will be useful in the context of NTRU.
Masking lattice-based public-key encryption. We review the existing masked implementations of lattice-based public-key encryption, including the NIST finalists Kyber, Saber and NTRU. To achieve IND-CCA security, the Kyber and Saber schemes use the Fujisaki-Okamoto transformation [FO99], based on the recomputation and comparison of the ciphertext during decryption. The first completely masked implementation of Kyber secure against high-order attacks was described in [BGR + 21]. For the ciphertext comparison, the masked recomputed ciphertext remains in uncompressed form, so that the compression function from Kyber need not be high-order masked. Alternative techniques for performing the ciphertext comparison have also been recently described in [CGMZ21], for both Kyber and Saber.
However, the CCA security of NTRU in the NIST submission [CDH + 19] does not rely on the FO transform, but rather on the membership of the message to a specific space set. This is to ensure the well-formedness of the ciphertext, based on the correctness of the underlying deterministic PKE scheme [BP18]. Formally, the CCA security follows from the property that for (r, m) ∈ L r × L m , where L r and L m represent the plaintext space sets: NTRU.Enc((r, m), pk) = c ⇔ NTRU.Dec(c, sk) = (r, m) Therefore, the well-formedness of c is ensured by membership the test (r, m) ? ∈ L r × L m . So far in the literature the only masked implementation of NTRU is provided by [SMS19], for security against CPA and first-order attacks only. The authors focus on protecting the polynomial product c · f mod q, for the ciphertext c and the private-key f . Recently, [REB + 21] introduced a generic sidechannel CCA against NTRU exploiting the leakage during the membership test (r, m) ∈ L r × L m . This demonstrates that a masked implementation must include the masking of this membership test.
More recently, in a concurrent work [KLRBG22], the authors described a high-order masked algorithm to perform the polynomial inversion in the key generation of NTRU, based on a conversion from additive to multiplicative masking. The authors claimed that their high-order conversion algorithm can achieve arbitrary-order security, but without a security proof. As a security evaluation, the authors used a common fixed vs. random univariate first-order Test Vector Leakage Assessment (TVLA) evaluation procedure, with 100 000 power traces. However, we show in this paper that their algorithm is actually insecure: we exhibit a 3-rd order attack for any number of shares n in the countermeasure (see Section 5.1). We then describe a repaired algorithm with a proof of security in the ISW probing model.
Our contributions. In this paper we provide the first high-order masking of the NTRU KEM finalist. More precisely, we provide a full high-order masking of both the Decapsulate algorithm (for IND-CCA decryption), and the key generation algorithm. We consider the two HPS and HRSS variants of the NTRU submission [CDH + 19]. Our countermeasures are proven secure in the classical ISW probing model, using the NI/SNI methodology.
We argue that key generation must also be protected against side-channel attacks, because in practice, the key generation procedure can be performed directly in the embedded platform, and template attacks can be quite effective against key generation. To prove the side-channel resistance of KeyGen, we use the same ISW probing model as for other operations. That is, when using n = t + 1 shares, the KeyGen algorithm should be resistant against an adversary performing a t-th order probing attack.
Our techniques are as follows. For decryption, the main challenge is to compute the reduction modulo 3 of a polynomial a which is initially masked modulo q = 2 k . For this we proceed coefficientwise by first converting the arithmetic sharing modulo q into Boolean shares, and then converting back to arithmetic modulo 3. We also describe the high-order masking of the membership tests r ∈ L r and m ∈ L m . For the later, in the HPS version, one needs to check that the polynomial m has exactly d/2 coefficients equal to 1, and d/2 coefficients equal to −1, for d = q/8 − 2. For this, we high-order compute the sum of the coefficients and check that it is equal to 0 modulo q, and we check that the sum of the squares of the coefficients is equal to d modulo q.
For masking the key generation, we show how to mask the sampling of the private key, which includes the sampling of an arithmetically masked polynomial with exactly d/2 coefficients equal to 1 and exactly d/2 equal to −1. To do so, we start with a fixed polynomial g I with the first d/2 coefficients equal to 1, the next d/2 coefficients equal to −1, and the remaining coefficients equal to 0. We then compute an arithmetic sharing g 1 , . . . , g n of g I . We then repeat n times the following procedure: we generate a random permutation π of the coefficients and apply π to each share g i , and then linearly refresh the shares g i . Eventually, we return the shared polynomial g 1 , . . . , g n . We show that we indeed obtain an n-sharing of a random polynomial g with the right distribution, and moreover an adversary with at most n − 1 probes learns nothing about the secret polynomial g.
For the key generation, we also show how to high-order compute the inverse of polynomials in Z q [X]/(Φ ) and Z 3 [X]/(Φ ). In the NIST submission, these inverses are computed using the almost inverse algorithm. However, such method would be quite challenging to mask, therefore we use exponentiation algorithms instead. More precisely, we compute the inverse of an element x in Z 2 [X]/Φ by using the relation x −1 = x 2 −1 −2 . Thanks to the linearity of the square in characteristic 2, such exponentiation only requires O(log ) multiplications, instead of O( ). One can then lift the inverse from modulo 2 to modulo 2 k . Both operations are easy to high-order mask with n shares, and as previously, we prove that an adversary with at most n − 1 probes learns nothing about the secret-key. We then provide a comparison with our repaired algorithm from [KLRBG22].
Finally, using the above gadgets, we describe a full high-order masking of both the Decapsulate algorithm (for IND-CCA decryption), and of the key generation algorithm. For Decapsulate, this includes the masking of the PackS3 algorithm for converting ternary polynomials into a sequence of bytes. Namely, the PackS3 algorithm is used for computing the hash k 1 = H 1 (r, m) when recovering the session key k 1 , which must be output in masked Boolean form.
Implementation. In order to assess the practicality of our countermeasures, we have performed a proof of concept implementation in C of the fully masked Decapsulate and KeyGen. We have run our implementation on a laptop equipped with an Intel CPU, and also on a Cortex-M3 core mounted on an Arduino Due board. We provide the performance analysis in Section 8. The source code can be found at https://github.com/fragerar/Masked_NTRU Finally, we have performed a leakage evaluation with a fixed vs random t-test over 10 000 traces for one of the main gadgets, namely the reduction modulo 3 used in Decapsulate. For this, we have used the ChipWhisperer Lite board embedding a Cortex-M4 microcontroller (STM32F303) and a light oscilloscope; we provide the results in Section 8.
For any integer x, x mod q will denote the positive representative of x, and x mod ± q the centered one. We denote by x k (resp. x k) the right (resp. left) shifting of an integer x by k positions, Polynomial ring. Let q be an integer, we denote by Z q [X] the ring of polynomials with coefficient in Z q . For a prime , we let Φ 1 and Φ be the first and the -th cyclotomic polynomials X − 1 and 1 + X + · · · + X −1 respectively. We recall the notations from [CDH + 19]. We denote by S/q the quotient ring Z q [X]/Φ . A polynomial in Z[X] is said to be ternary if its coefficients are in {−1, 0, 1}. We denote by T the set of non-zero ternary polynomials of degree at most − 2. Equivalently, T can be seen as the set of representatives of non-zero polynomials from the quotient Z 3 [X]/Φ . For an even positive integer d, we also denote by T (d) the subset of T consisting of polynomials that have exactly d/2 coefficients equal to +1 and d/2 coefficients equal to −1. Finally, let T + denote the set of positively correlated ternary polynomials, i.e polynomials v ∈ T such that i v i · v i+1 ≥ 0.

Definitions
We recall the definitions of (strong) non-interference security (SNI/NI) introduced in [BBD + 16]. Thanks to these definitions, a proof of security against an attacker with at most t probes can proceed in two steps: firstly one proves that every gadget satisfies the SNI definition, secondly one applies a composition theorem. The SNI definition is stronger than NI in that the number of input shares needed for the simulation only depends on the number of internal probes and not on the number of output variables to be simulated. Fortunately, the NI definition is not restrictive since composing a NI gadget with an SNI one achieves SNI security. Hence, any NI gadget can be enhanced to SNI by applying an SNI mask refreshing to its output. In this paper, we will prove that all our gadgets achieve at least NI security.
Definition 1 (t-NI security). Let G be a gadget taking as input (x i ) 1≤i≤n and outputting the vector (y i ) 1≤i≤n . The gadget G is said t-NI secure if for any set of t 1 ≤ t intermediate variables, there exists a subset I of input indexes with |I| ≤ t 1 , such that the t 1 intermediate variables can be perfectly simulated from x |I .
Definition 2 (t-SNI security). Let G be a gadget taking as input n shares (x i ) 1≤i≤n , and outputting n shares (z i ) 1≤i≤n . The gadget G is said to be t-SNI secure if for any set of t 1 probed intermediate variables and any subset O of output indexes, such that t 1 + |O| ≤ t, there exists a subset I of input indexes that satisfies |I| ≤ t 1 , such that the t 1 intermediate variables and the output variables z |O can be perfectly simulated from x |I .

The NTRU cryptosystem
In this section, we recall the second round NTRU submission from [CDH + 19]. It is based on a deterministic public-key encryption scheme (DPKE) described in algorithms 1, 2 and 3. The Key Encapsulation Mechanism (KEM) is depicted in algorithms 4, 5 and 6. For the two submitted versions of NTRU, namely NTRU-HPS and NTRU-HRSS, we recall in Table 1 the definition of the sets of integer polynomials L f , L g , L r , L m , and the embedding Lift. For simplicity, the algorithms are described according to the NTRU-HPS version, for which Lift(m) = m. We also recall in Table 2 the values of the parameter and modulus q for the four versions of NTRU.
Alg. 1 KeyGen(seed) Alg. 5 Encapsulate(h) 1: coins ← {0, 1} 256 2: (r, m) ← Samplerm(coins) 3: c ← Encrypt(h, (r, m)) 4: k ← H 1 (r, m) 5: return (c, k) The NTRU DPKE scheme. We briefly explain why the DPKE scheme works (alg. 1, 2, 3). Since is a prime, and 2 is of order − 1 in Z * , we get that Φ is an irreducible polynomial modulo 2. We deduce that the set of polynomials modulo 2 and Φ is a field, and therefore f ∈ L f is invertible modulo 2 and Φ . One can then lift the inverse from modulo 2 to modulo q and Φ . The same holds for the inverse of f modulo 3. Note that from g ∈ L g , we have g = 0 (mod q, Φ 1 ), and therefore h = 0 (mod q, Φ 1 ).  The encryption of m is given by: From Line 2 of Algorithm 3, we have: By definition we have h · f = 3 · g (mod q, Φ ). This gives a = 3 · g · r + m · f (mod q, Φ ). Moreover, from m = 0 (mod Φ 1 ), we have c = 0 (mod q, Φ 1 ), and therefore a = 0 (mod q, Φ 1 ). Besides we have g = m = 0 (mod q, Φ 1 ), therefore we deduce: One can show that the equation also holds over Z, not only modulo q. Namely, the polynomials g, r, m and f have small coefficients, therefore the equality holds over Z when we represent the polynomials modulo q with coefficients between −q/2 and q/2. This gives: We deduce that a = m · f (mod 3, Φ 1 Φ ), and therefore m ≡ a · f p (mod 3, Φ ). Since deg m ≤ − 2 and m is ternary, we must have m = a · f p (mod 3, Φ ), as computed in Line 3 of Algorithm 3. Finally, we have: and since deg(r) ≤ − 2, we can recover r at Line 4 with r = (c − m) · h q mod (q, Φ ).
CCA security of NTRU. The CCA security of NTRU is a consequence of its rigidity. The rigidity expresses as follow, for (r, m) ∈ L r × L m : Therefore, the FO transformation can be avoided by using the membership check (r, m) ∈ L r × L m since it ensures a correct ciphertext recomputation. Eventually, the rigidity is ensured by the choice of parameters in Table 1, see [BP18,HRSS17].
The NTRU KEM. The KEM version of NTRU proceeds similarly to the NTRU DPKE scheme (alg. 4, 5, 6). It adds a seed s to the secret key. This seed is used for implicit rejection during the decapsulation in order to preserve CCA security [BP18]. The Encapsulate algorithm samples r and m according to their space set and encrypts them into c. Then it hashes (r, m) into the session key k. Eventually, the Decapsulate algorithm decrypts the ciphertext c into (r , m , f ail). When no decryption failure occurs, the rigidity of the NTRU DPKE schemes ensures that r and m match the original r and m from encryption, which enables to recover the session key k.

New gadgets for high-order masking NTRU
In this section, we describe the high-order masking of the main components of the NTRU cryptosystem. We recall in Appendix A the main masking tools, such as arithmetic vs Boolean conversions, and zero-testing with Boolean or arithmetic shares.

Decryption: masking the reduction modulo 3
The polynomial a at Step 2 of Decrypt (Algorithm 3) is arithmetically masked modulo q, because the secret-key f is arithmetically masked modulo q. Namely, given as input the ciphertext c and the masked secret-key f = f 1 + · · · + f n (mod q), we obtain: and letting a i = c · f i mod (q, Φ 1 Φ ), we obtain a = a 1 + · · · + a n (mod q) as required.
The main difficulty is then to compute the polynomial a modulo (3, Φ ), which corresponds to Step 3 of Decrypt. Namely the polynomial a satisfies where the polynomials g, r, m and f have small coefficients, and therefore the equality holds over the integers (not only modulo q). This enables to get rid of the 3 · g · r part by reduction modulo 3. One must therefore perform this operation while the polynomial a is arithmetically masked modulo q. Note that we cannot directly reduce each share a i modulo 3 when a is arithmetically masked modulo q, as the reduction modulo 3 is not linear over the ring Z q . 1 This implies that a more complex technique is required.
For this, the idea is to first convert each coefficient of a from arithmetic masking modulo q into Boolean masking, and then perform a conversion from Boolean masking to arithmetic masking modulo 3. More precisely, we write q = 2 k and we consider a coefficient −2 k−1 ≤ x < 2 k−1 . We write x = 3·u+v with 0 ≤ v < 3. Given as input an arithmetic sharing of x modulo 2 k , we must output an arithmetic sharing of v modulo 3. We write x (j) the j-th bit of x mod 2 k , so we can write: Consider for example a masking with two shares x1 and x2 with q = 256, and let x = x1 + x2 mod 256, with x1 = 222 and x2 = 57, which gives x = 23. If we reduce x1 and x2 directly modulo 3, we obtain (222 mod 3) + (57 mod 3) = 0 (mod 3), but on the other hand we have x mod 3 = 2. So reducing the shares modulo 3 directly would give an incorrect result. and therefore we obtain the value of v = x mod 3 as a function of the bits We now explain how to high-order compute v modulo 3 from an arithmetic masking of x modulo q = 2 k . Taking x = x 1 + · · · + x n (mod q) as input, we first perform an arithmetic to Boolean masking conversion, so we obtain x = y 1 ⊕ · · · ⊕ y n with y i ∈ {0, 1} k for all 1 ≤ i ≤ n. Letting y n for all 0 ≤ j < k. Therefore we perform a Boolean to arithmetic modulo 3 conversion of each x (j) , which gives for all 0 ≤ j < k: Eventually, we obtain by combining (3) and (4): which gives an n-sharing of v modulo 3, as required. We provide the corresponding algorithm below. We refer to Appendix A.1 for an overview of the conversion algorithms AtoB 2 k and BtoA 3 , which are assumed to satisfy the SNI property. Note that our algorithm can work for any modulus q, not only 2 k , by using an algorithm for converting from arithmetic modulo q to Boolean masking at Line 1.
n ) 5: end for 6: for i = 1 to n do 7: Security. The following theorem shows that the Mod3Red algorithm achieves the t − SNI security notion.
Theorem 1 (t − SNI security of Mod3Red). For any subset O ⊂ [1, n] and any t 1 intermediate variables with |O| + t 1 ≤ t, the output variables w |O and the t 1 intermediate variables can be perfectly simulated from the input variables x |I , with |I| ≤ t 1 .
Proof. The t − SNI property of the part from lines 2 to 9 follows from the t − SNI of each of the k independent BtoA 3 conversions. Namely the corresponding output shares z (j) i are combined independently for each share index 1 ≤ i ≤ n. Therefore we can use the same output subset O for each intermediate output shares (z (j) i ) 1≤i≤n for 0 ≤ j < k. The t − SNI property of the complete algorithm follows from composition of two SNI gadgets.
Complexity. We assume that a group operation as well as randomness generation takes unit time. The complexity of Algorithm 7 is therefore: 4.2 Key generation: masked generation of g ← L g In this section, we explain how to generate an arithmetically masked g ← L g , which corresponds to Line 1 of the KeyGen algorithm (Alg. 1). We consider only the HPS version, for which L g = T (q/8 − 2), see Table 1. We will consider the HRSS version in Section 7.2. Obviously, we cannot simply generate an unmasked g ← L g and later arithmetically mask it with n shares, as the attacker could directly probe the unmasked g. Therefore, the key generation algorithm must be masked with n shares from the beginning.
Recall that T (q/8 − 2) is the set of ternary polynomials of degree at most − 2 containing exactly q/16−1 coefficients equal to 1, and q/16−1 coefficients equal to −1. In the NIST submission [CDH + 19], the authors apply a random permutation to the coefficients of an initially fixed polynomial g I with its first q/16 − 1 coefficients equal to 1, its q/16 − 1 following coefficients equal to −1, and its remaining coefficients equal to 0. Actually, the applied permutation is not perfectly random. Namely, in the corresponding FixedType algorithm from [CDH + 19], given a 30( − 1)-bit seed, the permutation is obtained by concatenating to each coefficient a 30-bit prefix, then sorting the list of 32-bit entries, and eventually discarding the 30-bit prefix to keep the permuted coefficients. Obviously, such procedure would be quite challenging to mask directly.
Alternatively, we use the following simple approach, which also provides a perfectly random permutation. We start with the initial polynomial g = g I as previously, and we encode g over n = t + 1 shares with arithmetic masking modulo q, for security against t probes. We then repeat the following procedure n = t + 1 times: we randomly permute the − 1 coefficients of g by generating an independent random permutation π; for this, we actually apply π on each share of g; we then perform a linear mask refreshing modulo q of each coefficients of g. Eventually, we output the arithmetically masked polynomial g modulo q. We describe the algorithm below. We denote by P −1 the set of permutation of {0, . . . , − 2}. We assume that we have an efficient algorithm for generating a permutation π ← P −1 uniformly at random. We recall the LinearRefresh algorithm in Appendix A.5, applied on the quotient ring S/q = Z q [X]/Φ . Security. The above algorithm is secure against an adversary with at most t = n − 1 probes, because by definition, at least one of the n permutations and subsequent linear mask refreshing has not been probed, after which the adversary's probes can be perfectly simulated without knowing the secret key. This is the same security argument as for proving the security of the table recomputation countermeasure [Cor14]. Formally, the following theorem proves the security of the above algorithm. For a key generation algorithm, there are no inputs, so we need to prove that for any generated secret-key g, any t < n probe can be perfectly simulated without knowing g.
Proof. We consider any fixed secret g ∈ T (d), and we consider a secret π ← P −1 such that g = π(g I ), We denote by Part j for 1 ≤ j ≤ n the execution steps of the algorithm during the for loop from Line 2 to Line 6. Since there are t 1 < n probed variables, at least one execution of the for loop has not been probed. Let j be the corresponding index, such that Part j has not been probed.
We split the probed variables into 2 sets: S <j and S >j , which correspond to the variables probed during execution of Part j for j < j and j > j respectively. The variables from S j<j can be perfectly simulated without the knowledge of g. Indeed, for each index j < j , it suffices to draw π j ← P −1 uniformly at random, and simulate all variables from the initial sharing of g I at Step 1 and π j .
In order to simulate the variables from S >j , we define a set of indexes I such that i ∈ I iff a variable g i has been probed. By construction we have |I| ≤ t 1 < n. Since Part j has not been probed, the corresponding LinearRefresh gadget has not been probed, hence any subset of at most n − 1 output shares is uniformly and independently distributed; hence the corresponding outputs g |I can be perfectly simulated. One can then propagate the simulation for the Part j processes for j > j , and simulate any variable from the set S >j from such g |I ; as previously we generate the permutations π j for j > j uniformly at random in P −1 .
Finally, for consistency we must have π = π n • · · · • π j • · · · • π 1 , which is possible by fixing the permutation π j satisfying this equation. The knowledge of π j is not required for the simulation, since by assumption Part j has not been probed. Hence the simulation can be performed without the knowledge of π and the output secret-key g.

Complexity. The time complexity of the algorithm is
Remark 1. Note that our security model assumes that the adversary can only probe at most n − 1 of the n permutations, so in the security proof at least one permutation can be treated as a black-box. However, for security against real side-channel leakages, it may be difficult to implement a permutation so that this assumption is satisfied in practice. More precisely, it may be possible to perform a template attack against the permutations, so that using a single trace, the adversary could recover all n permutations and eventually the secret-key. We refer to [KAA21] for an example of such attack.

Key generation: high-order computation of 1/f modulo q
In this section, we show how to high-order compute the secret Therefore, we first recall how to compute inverses in S/2 = Z 2 [X]/Φ .
Computing inverse over S/2. Since Φ (x) is irreducible modulo 2, the multiplicative group S/2 = Z[X]/(2, Φ ) has order 2 −1 − 1. Therefore, we can first compute the inversion of f in Z[X]/(2, Φ ), using a sequence of squares and multiplies as in [IT88], and then lift the result modulo q. Namely, such exponentiation approach is much easier to mask than the extended-gcd approach. More precisely, we must compute: To compute this exponentiation, we use the identity 2 a+b − 1 = 2 a · (2 b − 1) + (2 a − 1), which gives: where the exponentiation by 2 a is a linear operation. In particular, we obtain: which implies that we can perform the equivalent of a square-and-multiply. We provide the corresponding FastExpo algorithm below, with the proof of correctness (Theorem 3) in Appendix B.1. if m i = 1 then y ← y 2 × x 6: end for 7: return y Theorem 3 (Correctness). Given as input x ∈ Z 2 [X]/Φ , Algorithm 9 outputs x 2 m −1 in log 2 m + H w (m) − 1 ≤ 2 log 2 (m) non-linear multiplications, where H w (m) is the Hamming weight of m.
Computing inverse over S/q = Z q [X]/Φ . We now recall how to compute inverses over S/q = Z q [X]/Φ . For this we recall the unmasked SqInverse algorithm from [CDH + 19], which lifts the inverse modulo 2 into an inverse modulo 2 2 i at each step i of the while loop, until 2 2 i ≥ q. We provide the proof of correctness in Appendix B.2.
High-order masking. The two previous algorithms are easy to mask. Namely, for the FastExpo algorithm, it suffices to high-order mask the polynomial multiplications at lines 4 and 5. This can be done via a SecMult algorithm, as a straightforward extension of the And gadget from [ISW03]. We provide in Appendix B.3 the high-order masking of the FastExpo algorithm, called SecFastExpo. Similarly, we provide in Appendix B.4 an algorithmic description of the high-order masked version of Algorithm 10 above, called SecSqInverse. Note that after Line 2 of Algorithm 10, the polynomial v must be considered modulo q instead of modulo 2, so we consider each share of v as a share modulo q. The final complexity of our polynomial inversion algorithm in S/q = Z q [X]/Φ is O(n 2 · (log + log log q)) operations in S/q. We provide the proof of the following theorem in Appendix B.4. Addition chains. More generally, to compute the exponentiation given by (5), from (6) it suffices to provide an addition chain for the integer − 2. The number of additions in the chain gives the number of multiplications in Z[X]/(2, Φ ). From the square-and-multiply algorithm above, there always exists an addition chain for m = − 2 with log 2 m + H w (m) − 1 ≤ 2 log 2 (m) additions. However, one can often find better addition chains. For example, in [HRSS17], the authors compute the inversion in F 2 700 with 12 multiplications only (instead of 15 with the square-and-multiply). We refer to Appendix B.6 for more details.

The polynomial inversion algorithm from [KLRBG22]
Recently, the authors of [KLRBG22] described a high-order masked algorithm to perform the polynomial inversion in the key generation of NTRU, based on a conversion from arithmetic to multiplicative masking. The authors claimed that their high-order conversion algorithm can achieve arbitrary-order security, but without a security proof. Below, we show that their algorithm is actually insecure: we exhibit a 3-rd order attack for any number of shares n in the countermeasure. We then describe a simple reparation with a proof of security, and we eventually provide a comparison between our high-order inversion algorithm from Section 4.3 and the repaired algorithm.

Our third-order attack
Let R be a ring. The technique used in [KLRBG22] to high-order compute the inverse of an element a ∈ R is to use a multiplicative masking a = n i=1 m i with invertible elements m i ∈ R , so that the inversion in R becomes a linear operation in the number n of masks (instead of quadratic for additive masking): We recall in Algorithm 11 below the arithmetic to multiplicative masking conversion algorithm from [KLRBG22,Alg. 4].
Algorithm 11 Additive to multiplicative conversion (A2M) Input: An arithmetic masking a = a 1 + · · · + a n ∈ R Output: A multiplicative masking a = n i=1 m i ∈ R 1: for i = n downto 2 do a i−1 ← a i−1 + a i 8: end for 9: m 1 ← a 1 10: return m 1 , . . . , m n Our attack. We describe a 3-rd order attack that works for any number of shares n. We probe the initial value a 1 , the value a 1 of the variable a 1 for the last index i = 2 after Line 5, and the output variable m 1 . Since for each n ≥ i ≥ 2 the random r i is multiplicatively accumulated on the variable a 1 , we obtain: which shows that the secret value a can always be recovered from the 3 probes a 1 , a 1 and m 1 . This shows that for any number of shares n, the countermeasure can provide at most second-order security.
In [KLRBG22, Alg. 6] the authors also described an optimization of their algorithm, which consists in converting the additive shares a = a 1 + · · · + a n into multiplicative shares of the inverse of a, namely a −1 = m 1 × · · · × m n , using a single inversion instead of n − 1. Our 3-rd order attack also applies against this variant. In the following, we focus on this variant since it is more efficient (as it requires a single inversion in R instead of n − 1 inversions). More precisely, we provide a reparation of this later algorithm, with a proof of security in the ISW probing model.

Repaired polynomial inversion algorithm
Additive to multiplicative conversion. In this section, we describe the repaired high-order polynomial inversion algorithm, starting from the additive to multiplicative conversion algorithm described in [KLRBG22, Alg. 6], which requires a single polynomial inversion only. In order to repair such algorithm, it suffices to add a mask refreshing at each iteration of the for loop, and to delay the shares recombination to the end of algorithm. We provide the pseudo-code of the A2M INV algorithm below; we refer to Appendix A.5 for the LinearRefresh algorithm. Such corrected version is actually similar to the zero-test algorithm in [CGMZ21, Algorithm 3], which is also based on an additive to multiplicative masking conversion. The time complexity of the modified algorithm is Algorithm 12 Additive to multiplicative conversion (A2M INV ) Input: a = a 1 + · · · + a n Output: a −1 = m 1 · · · · · m n 1: for i = 1 to n do 2: for j = 1 to n do a j ← r i · a j 4: a 1 , . . . , a n ← LinearRefresh R (a 1 , . . . , a n ) 5: Proof. We denote by Part i for 1 ≤ i ≤ n the steps of the algorithm from Line 1 to Line 6 in the For loop with index i, and by a (i) j the value of the share a j at the end of Part i . Let P = {i | Part i has been probed or i ∈ O}. From t 1 + |O| < n we deduce P [1, n] and therefore there exists i such that Part i has not been probed and i / ∈ O. We construct a subset I ⊂ [1, n] of input indexes for the simulation. We start with an empty I and for each probed variable a j we add j to the set. By construction we must have |I| ≤ t 1 .
Every probed variable in Part i for i < i can be perfectly simulated from a |I . It remains to simulate the variables probed at Part i for i > i . Since by assumption m i and r i have not been probed and i / ∈ O, the random r i acts as a one-time pad for the value a (i ) = a (i ) 1 + · · · + a (i ) n . Moreover we note that a (i ) = a · m 1 · · · · · m i is invertible as the product of invertible elements. Therefore, a (i ) is uniformly distributed in R . Since Part i has not been probed, the corresponding LinearRefresh instance has not been probed. We can therefore perfectly simulate all shares a (i ) j at the end of Part i with fresh random values whose sum is invertible. Such simulation can subsequently be propagated to all a j variables until the end of the algorithm. We therefore conclude that Algorithm 12 is (n−1)−SNI.
Multiplicative to additive masking conversion. In [KLRBG22], the authors also provide a multiplicative to additive masking conversion algorithm, without a security proof. In the following, we recall their algorithm, and prove that it achieves the t − SNI security property. We refer to Appendix E for the proof. The complexity of the algorithm is T M2A (n) ∼ 2n 2 .

Comparison
The repaired polynomial inversion algorithm from [KLRBG22] is asymptotically faster than our algorithm from Section 4.3, since for inversion in S/q = Z q [X]/Φ , its complexity is O(n 2 +log ) operations in S/q, instead of O(n 2 · (log + log log q)). This is confirmed experimentally in tables 3 and 4 below, in which we compare the cycle count and randomness consumption between the two polynomial inversion algorithms.   6 High-order masking of NTRU decryption In the previous sections, we have considered the masking of some specific components of NTRU. In this section, we consider the full high-order masking of the NTRU IND-CCA decryption, more precisely the Decapsulate algorithm (Alg. 6). We first recall the NTRU Decrypt and Decapsulate algorithms, already described in Section 3. The Decrypt algorithm takes as input the ciphertext c and returns (r, m) if the ciphertext c is well formed (fail = 0), otherwise it returns fail = 1. If the ciphertext is well formed, the Decapsulate algorithm returns the session key k 1 = H 1 (r, m), otherwise it returns the dummy key k 2 . We summarize below the high-order masking of the Decrypt and Decapsulate operations.

At
Step 1 of Decrypt, the input ciphertext is unmasked, so we can perform the test c = 0 mod (q, Φ 1 ) in clear.
reduction modulo 3. This has been described in Section 4.1. After high-order multiplication by f p , which is arithmetically masked modulo 3, we eventually obtain the masked message m modulo 3.

At
Step 4 of Decrypt, we must first convert m from arithmetic masking modulo 3 to masking modulo q. See Appendix A.2 for a description of the technique. We can then obtain an arithmetic masking of r modulo q.

At
Step 5 of Decrypt, we must test membership r ∈ L r = T and m ∈ L m from masked r and m.
We describe the corresponding high-order algorithms in sections 6.1 and 6.2 below. The bit f ail can be computed in the clear.

At
Step 1 of Decapsulate, we obtain masked polynomials m and r, modulo q. For hashing (r, m) at Step 2 in Decapsulate, we must high-order mask the packS3 algorithm from [CDH + 19], which is applied to (r, m) before hashing, with a Boolean masked output; see Section 6.3. We then highorder compute the hash function H 1 over Boolean shares, and the session-key k 1 is eventually returned with Boolean shares. The same procedure is applied for H 2 if fail = 1.

Testing membership r ∈ L r = T
The membership test r ∈ L r = T is used at Step 5 of Decrypt. Recall that T is the set of non-zero ternary polynomials of degree at most − 2. We actually test if r ∈ T ∪ {0}, which means that we consider (r, m) with r = 0 as a legitimate plaintext in the DPKE scheme. We consider the − 1 coefficients r (j) of r, where each coefficient is arithmetically masked modulo q with n shares. To test if r ∈ T ∪ {0}, we must check that each of the − 1 coefficients r (j) is in {−1, 0, 1}. More precisely, we must high-order compute the bit: which we can rewrite as: In order to high-order compute (7), we first convert each coefficient r (j) from arithmetic to Boolean masking (see Appendix A.1). Secondly, we xor the first share with −1, 0 and 1 modulo q. Thirdly, we perform 3 zero-tests on Boolean shares to check whether the coefficient equals −1, 0 or 1 (see Appendix A.3). We then perform a secure Or between the 3 resulting tests, using x ∨ y = x ∧ y, with the same secure And gadget as in [ISW03]. Eventually, we obtain a Boolean sharing of the bit b. Since we must perform an arithmetic modulo q to Boolean conversion for each of the coefficients, the complexity is O( · log(q) · n 2 ).

Testing membership m ∈ L m
The membership test r ∈ L m is used at Step 5 of Decrypt. In the HRSS version, we have L m = T (see Table 1), and since the coefficients of m are ternary by definition (as they are obtained modulo 3), we do not need to perform any additional test. For the HPS version, we have L m = T (q/8 − 2), so we need to check that m has q/16 − 1 coefficients equals to 1 and q/16 − 1 coefficients equals to −1. To do so we first check if the sum of the coefficients of m is zero, and we then test if the sum of squared coefficients of m is q/8 − 2. More precisely, given the − 1 coefficients (m (0) , . . . , m ( −2) ) of m, we high-order compute the bit: For this, we need to perform two zero-tests on arithmetic sharing modulo q, starting from an arithmetic masking modulo q of the coefficients of m (which is also required for the high-order computation of r at Step 4 of Decrypt); see Appendix A.4. The complexity is O((log(q) + ) · n 2 ).
Note that for the testing of m ∈ L m and r ∈ L r , the adversary should not learn whether m ∈ L m and r ∈ L r separately, so we must keep the result of both tests in masked form before returning the result of the And of the two tests. However this final result (fail = 0 or fail = 1) is not sensitive and can be computed in the clear. The total complexity is O( · log(q) · n 2 ).

Packing ternary polynomials
In the NIST submission of NTRU [CDH + 19], the authors describe the PackS3 algorithm for converting ternary polynomials into a sequence of bytes. In particular, the PackS3 algorithm is used for computing the hash k 1 = H 1 (r, m) at Step 2 of Decapsulate.
More precisely, given as input a vector v of 5 ternary coefficients v = (v 0 , . . . , v 4 ) ∈ {0, 1, 2} 5 , the packS3 algorithm interprets the vector v as an integer 0 ≤ x < 243 in base 3: which is then converted into a 8-bit string. The above procedure is applied sequentially on chunks of five coefficients of the polynomial until no coefficient is left. When the polynomials r and m are arithmetically masked modulo 3, the above coefficients v j 's are also masked modulo 3. Therefore, we first perform an arithmetic modulo 3 to arithmetic modulo 256 conversion of each coefficient v j (we refer to Appendix A.2 for a full description of the conversion algorithm): v j = v j,1 + · · · + v j,n (mod 3) = w j,1 + · · · + w j,n (mod 256) (9) Combining (8) and (9), we obtain an arithmetic masking of x modulo 256: Eventually we perform an arithmetic to Boolean conversion of x. The final complexity is O(n 2 ) for n shares.
In the algorithm above we have assumed that the polynomial r is initially masked modulo 3, while after Step 4 of the Decrypt algorithm (Alg. 3), the polynomial r is actually masked modulo q. However, we know after Line 5 that the polynomial r must be ternary. Therefore, we can use the Mod3Red algorithm from Section 4.1 to obtain an arithmetic masking modulo 3 of r. We also describe in Appendix C another method to pack ternary polynomials when they are arithmetically masked modulo q.

High-order masking of NTRU key generation
In this section, we consider the high-order masking of the NTRU key generation. We first recall the KeyGen algorithm, already described in Section 3.

Algorithm 1
KeyGen We summarize below the high-order masking of the KeyGen algorithm:

At
Step 1 of KeyGen, we must obtain the masked secret f ← L f . In the HPS version, L f = T , which is the set of non-zero ternary polynomials. We describe the corresponding algorithm in Section 7.1.
In the HRSS version, we have L f = T + . We describe the corresponding algorithm in Section 7.2. In both cases, we output both an arithmetic masking modulo 3 and an arithmetic masking modulo q of the polynomial f .
2. Similarly, we must generate g ← L g . The polynomial g must be masked modulo q. In the HPS version, we must sample g ∈ T (q/8 − 2). The procedure was already described in Section 4.3. In the HRSS version, we must sample g ← Φ 1 · T + , see Section 7.2.

At
Step 2, we must mask the inversion f q ← (1/f ) mod (q, Φ ), starting from an arithmetic masking modulo q of f . The inversion can be computed as a sequence of squares and multiplies in the finite field modulo (2, Φ ), and then lifted by a sequence of multiplications to modulo (q, Φ ). This was already considered in Section 4.3.

At
Step 3, we compute a high-order multiplication of g and f q to obtain the public-key h, whose shares are recombined. The inversion at Step 4 is then done in the clear. Namely, h q is part of the secret key only to fasten the recomputation of r during the CCA decryption, but h q does not need to be secret since it can be computed from the public key h.

Finally, at
Step 5, we must also high-order compute the inversion f p ← (1/f ) mod (3, Φ ). This is also performed as a sequence of squares and multiplies in the finite field modulo (3, Φ ), as when working modulo 2. We describe this procedure in Appendix D.

Masked generation of f ← L f with L f = T (HPS version)
We describe the high-order masked generation of f ← L f at Step 1 of KeyGen. We first consider the HPS version where L f = T ; we will consider the HRSS version in the next section. Recall that T is the set of non-zero ternary polynomials of degree at most − 2. Therefore |T | = 3 −1 − 1. For simplicity we can actually generate a random f ∈ T ∪ {0}, so that we can generate each coefficient of f in {−1, 0, 1} independently. 2 The high-order sampling is straightforward: we simply generate independently n polynomials f i for 1 ≤ i ≤ n with random coefficients modulo 3. The polynomials f i 's will be the n arithmetic shares modulo 3 of the secret polynomial f : Recall that we must also obtain an arithmetic sharing modulo q of f . For this we will convert each coefficient f (j) of f from masking modulo 3 to modulo q. This is easily done by applying the table-based conversion algorithm from [CGMZ22], see Appendix A.2.

Masked generation of
In the HRSS version of the scheme, one must sample the polynomial f in the set T + , which is a subset of T containing solely polynomials Elements of T + are said to be non-negatively correlated; we refer to [CDH + 19, Section 2.2.4] for the motivation of generating f in T + rather than T .
We first describe the unmasked version. We first randomly generate a random element v ← T , i=0 v (i) X i . We then compute the correlation: If t < 0, we flip the sign of even-indexed coefficients, so that we obtain a positive t. Indeed, letting v be the polynomial with flipped coefficients and letting t be its correlation, we obtain: For the high-order masked version, we start from a high-order masked v ← T from the procedure of Section 7.1, with an arithmetic masking modulo q. We can high-order compute the value t in (10) using a sequence of secure multiplications and additions modulo q. The sign of t can then be retrieved by converting to Boolean masked form and extracting the most significant bit. This sign bit is not sensitive, since eventually we must have t ≥ 0. Therefore it can be unmasked, and if t < 0 we can flip the even-indexed coefficients over the arithmetic shares modulo q. Note that the value of t can be computed modulo q, because we must have |t| < < q/2. The complexity is O((log(q) + ) · n 2 ).
Masked generation of g ← L g = Φ 1 ·T + (HRSS version). We proceed similarly for the generation of g ← L g = Φ 1 · T + , simply by generating a random element in T + as above, and then multiplying by Φ 1 .

Implementation results
In order to assess the practicality and scalability at high-order of our countermeasure, we have performed a proof of concept implementation in C. The source code can be found at https://github.com/fragerar/Masked_NTRU We have run our implementation on a laptop equipped with an Intel CPU, and also on a Cortex-M3 core mounted on an Arduino Due board. Random numbers are generated using a simple xorshift PRNG, a secure implementation should replace it by a cryptographically secure PRNG or a TRNG.
Performances on Intel CPU. We provide the running times for various security orders t in tables 5, 6, 7 and 8. More precisely, in Table 5, we display the cycle counts for the masked version of the decapsulation procedure incorporated in the reference code, across all parameters sets. The scaling seems to be quite reasonable for all versions of NTRU. However, this result is slightly biased by the fact that the polynomial multiplication used in the reference code of NTRU is not optimized. Indeed, this operation is relatively slow, and therefore the overhead incurred by our new gadgets is relatively low, since a large amount of time is spent in the polynomial multiplications.  Similarly, we provide in tables 6 and 7 the cycle count for the key generation, using the exponentiation method from Section 4.3 and the multiplicative method from Section 5.2. We see that as in Section 5.3, the later is more efficient.  Table 6: Cycle counts for key generation (exponentiation method) for all parameters of NTRU, in thousands of cycles, on Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz.  Table 7: Cycle counts for key generation (multiplicative method) for all parameters of NTRU, in thousands of cycles, on Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz.
We also provide in Table 8 the cycle counts using the AVX2 optimized version of the reference code for the ntruhps2048509 parameter set, significantly reducing the cost of polynomial multiplication. We obtain a significant speed-up for the Decapsulate, KeyGen (exponentiation method) and KeyGen' (multiplicative method) algorithms. In particular, since KeyGen and KeyGen' consist almost only in polynomial multiplications (and randomness generation), their runtime is hugely reduced by the AVX2 optimizations, which makes it competitive with the decapsulation. On the other hand, the overhead to mask the decapsulation is now way larger, since gadgets not depending on the polynomial arithmetic are taking a larger amount of the runtime. 3 We also display in Table 8 the relative performances of the gadgets. We see that the reduction modulo 3 and the ternary check are the most time consuming, because of the conversions between arithmetic and Boolean masking.   Table 9 the randomness usage of the full decryption (Decapsulate) and of the key generation (KeyGen and KeyGen'); we also provide the randomness consumption of the main gadgets. As expected, the number of calls to the RNG is growing significantly when the order increases. In general, randomness usage is strongly correlated to performances, because shares refreshing is needed at the core of most gadgets to ensure security in the probing model. The exceptions are gadgets that manipulate polynomials with small coefficients such as the masked multiplication of ternary polynomials and the key generation procedure. Indeed, they are cheap in terms of randomness since multiple coefficients can be extracted from a 32-bit integers but are still performing the expensive polynomial multiplication in the ring. Note that for the gadgets performing refreshes modulo q, a whole call to the RNG is counted for each value in Z q . In practice, at least two values could be extracted from the 32-bit output of the RNG, but it was not done for the sake of simplicity and to avoid potential leakage due to multiple random elements of Z q depending on the same initial random value.  Embedded implementation. In addition, since masking schemes are mainly aimed at embedded devices, we have also tested our code on a Cortex-M3 core mounted on an Arduino Due board. The cycle counts on this platform for the decapsulation and the key generation of ntruhps2048509 are displayed in Table 10. We see that the scaling of the masking scheme at different orders is mostly similar to the results of tables 5 and 6. This is not surprising since the implementation is in plain C and not optimized for any particular architecture.  Finally, we also provide some security guarantees by performing a fixed vs random t-test over 10 000 traces for one of the main gadgets, namely the reduction modulo 3 described in Section 4.1. The results can be found in Figure 1. The platform used for the experiments is a ChipWhisperer-Lite board that embeds a Cortex-M4 microcontroller (STM32F303) and a light oscilloscope.
For the leakage assessment, we have rewritten the gadget specifically at order 1 in ARM assembly, to avoid potential side-channel unsafe modifications from the compiler. We have conducted a fixed versus random t-test using the methodology described in [SM15]. The technique consists in performing the power consumption measurements while the device is executing the targeted gadget either with a fixed secret value chosen beforehand, or with a random value sampled before each measurement. This creates two sets of traces corresponding to the fixed vs the random values respectively. The t-test will then be used as a distinguisher between the two sets at each point in the power traces. If the values output by the t-test are high, it means that the statistical difference could potentially be used by the adversary to learn something about the secret key. In practice, we have used a set of 10 000 traces. For each trace, a coin was flipped to determine whether the random or the fixed secret value should be used.
We see in Figure 1 that when the RNG is switched off with randomness set to 0 (that is, without refreshing the shares), the random and fixed inputs are distinguishable as the t-values are well above the usual threshold |t| > 4.5. When the random number generator is switched on, values are properly masked and the test is successful on the gadget.

Conclusion
In this paper, we have described the first fully masked implementation of the NTRU Key Encapsulation Mechanism submitted to NIST (IND-CCA decapsulation and key generation), with a security proof in the ISW probing model. We have provided a concrete implementation on ARM Cortex-M3 architecture, showing that our implementation is reasonably efficient, and also a t-test leakage evaluation. Finally, we have described a 3-rd order attack against a high-order polynomial inversion algorithm for NTRU recently published in [KLRBG22], and a repaired algorithm with a security proof in the ISW probing model.

A Existing masking gadgets
In this section, we summarize the main masking gadgets used in the definition of our algorithms, with their running-time complexity and security property.

A.1 Conversion between arithmetic and Boolean masking
For the high-order masking of NTRU, we need to convert between arithmetic masking modulo 2 k and Boolean masking. Such high-order conversion was first described in [CGV14], with complexity O(n 2 ·k) for n shares and k-bit words, with the NI property, in both directions. To obtain the SNI property, it suffices to compose with a SNI mask refreshing. These conversion algorithms were later extended by [BBE + 18] to arithmetic masking modulo any integer q, with complexity O(n 2 · k) or even O(n 2 · log k), where k = log 2 (q), still with the SNI property.
Recently, a different algorithm was described in [CGMZ22], based on randomized table-recomputation, with the same complexity O(n 2 · k) in both directions, and satisfying the SNI property. An alternative algorithm for converting from Boolean to arithmetic masking is also described in [SPOG19], with the same property.
In summary, we can assume that we have SNI conversion algorithms denoted AtoB q and BtoA q , to convert between arithmetic masking modulo q and Boolean masking, with asymptotic complexity O(n 2 · log q) in both directions, and satisfying the SNI property.

A.2 Arithmetic modulo 3 to modulo q conversion
We describe the conversion from arithmetic masking modulo 3 to masking modulo 2 k . One could use the composition of two conversions with Boolean masking as a intermediate step, with complexity O(n 2 · k). Alternatively, a direct approach based on table recomputation is easier and more efficient, with complexity O(n 2 ) only.
More precisely, in [CGMZ22], the authors described the high-order computation of any function f : G → H where G and H are arbitrary groups. We instantiate their generic conversion with G = Z 3 , H = Z 2 k and the injection f : Z 3 → Z 2 k that maps 0, 1, −1 to 0, 1, (2 k − 1) respectively. This leads to the following algorithm below (Alg. 18), with complexity O(n 2 ). It uses a table T with 3 rows T (0), T (1) and T (2) of n shares each. As shown in [CGMZ22], the algorithm satisfies the SNI property.

A.3 Zero-testing over Boolean shares
We consider the zero-testing of a value x ∈ {0, 1} k over Boolean shares. More precisely, the algorithm takes as input a Boolean sharing of x, and returns a Boolean sharing of b ∈ {0, 1} such that b = 1 if and only if x = 0. Writing x = (x (0) , . . . , x (k−1) ) 2 the k bits of x, we have b = k−1 i=0 x (i) . Therefore the bit b can be high-order computed by using high-order secure And gadgets, with the SNI property. We refer to [CGMZ21] for the description of such an algorithm, with complexity T ZeroTestBool (k, n) = O(k · n 2 ).

A.5 Linear mask refreshing
We recall the LinearRefresh algorithm from [RP10], working in any additive group G: Algorithm 19 LinearRefresh Input: x 1 , . . . , x n ∈ G Output: y 1 , . . . , y n ∈ G such that y 1 + · · · + y n = x 1 + · · · + x n 1: y n ← x n 2: for j = 1 to n − 1 do 3: y j ← x j + r j 5: y n ← y n − r j 6: end for 7: return y 1 , . . . , y n B Computing inverses in S/q B.1 Proof of Theorem 3 (correctness of exponentiation in Z 2 [X]/Φ ) We claim Algorithm 9 is correct. Let x ∈ Z 2 [X]/Φ and m ∈ N. We show by induction on k − 1 ≥ i ≥ 0 that at the end of each iteration of the loop, the value y i of the variable y satisfies y i = x 2 M i −1 , where M i = m i. For i = k − 1, we have M k−1 = m k−1 = 1, hence y k−1 = x = x 2 M k−1 −1 as required. We now assume the result holds at iteration i and we show that the result holds at step i − 1. From the square step, we have y i = (y i ) 2 M i × y i , and after the multiply step, we have From 2 · M i + m i−1 = M i−1 and m i−1 − 2 m i−1 = −1 we deduce e = 2 M i−1 − 1. Hence the induction step is proven. Therefore y 0 = x 2 M 0 −1 = x 2 m −1 and the algorithm is correct.
Moreover we need a multiplication for each square step and from each multiply step with exception of the first square step which corresponds to 1 * 1. This lead to a number of multiplications:

B.2 Proof of Theorem 4
We claim that Algorithm 10 is correct. Indeed, we show by induction that at the beginning of each step i of the while loop we have t i = 2 i and v i · a = 1 (mod (2 t i , Φ )), where v i denotes the variable v at Step i. At step i = 0, by definition we have t 0 = 1. Moreover we have v 0 · a = 1 mod (2, Φ ).

B.3 Secure exponentiation modulo 2
We provide in Algorithm 20 the high-order masking of the FastExpo algorithm recalled in Section 4.3. We assume that we have a SecMult algorithm for high-order computing the product of two polynomials in Z 2 [X]/Φ , with the SNI property. It can be obtained as a straightforward extension of the And gadget from [ISW03].

B.4 Masking inversion in S/q
We provide an algorithmic description of the high-order masked version of the SqInverse algorithm from Section 4.3. As previously, we assume that we have a SecMulPoly algorithm for high-order computing the product of two polynomials in Z q [X]/Φ , with the SNI property, as it can be obtained as a straightforward extension of the And gadget from [ISW03].
C Packing S/3 polynomials from S/q During decryption, it is required to pack polynomials with coefficients in {0, 1, q − 1}. In the unmasked version, this is performed by first applying the map {0, 1, q − 1} → {0, 1, 2} to the five coefficients to obtain (v 0 , . . . , v 4 ) ∈ {0, 1, 2} 5 and then packing as depicted in Section 6.3. While straightforwardly applying the map is cheap in unmasked form, it is more expensive over shares. Instead, we use the following trick: consider the function f : Z 512 → Z 512 : x → x · (511 + 3x) that effectively maps the set {0, 1, 511} to {0, 2, 4} in Z 512 . We note that a masked version of f is fairly cheap to compute over arithmetic shares modulo 512 since the only non-linear operation is a SecMult. We first map the coefficients from {0, 1, q − 1} to {0, 1, 511} by reducing every share mod 512 (recall that q is a power of two) and then apply the masked f to bring the coefficients in {0, 2, 4} in arithmetic form modulo 512. Once we have our five coefficients (v 0 , . . . , v 4 ) ∈ {0, 2, 4} 5 , we compute x = 4 j=0 3 j · v j = 2 · 4 j=0 3 j · v j as in the regular packS3. Eventually, we obtain the correct result by performing an arithmetic to Boolean conversion of x and right-shifting every share by 1, effectively dividing x by 2. We note that it is trivial to find an equivalent to f over Z q and thus that we could have directly mapped {0, 1, q − 1} to {0, 2, 4} but we decided to first reduce modulo 512 (which is the smallest power of two giving a result holding over Z) to make the arithmetic to Boolean conversion cheaper. D High-order computing inverses over S/3 = Z[x]/(3, Φ )

3
[X] \ {0}| = 3 −1 − 1. Therefore, as in the modulo 2 case, we can compute the inverse of f via an exponentiation: To compute this exponentiation efficiently, we can adapt equation (6) from the modulo 2 case, using the identity 3 a+b − 1 = 3 a · (3 b − 1) + (3 a − 1): Adapting Algorithm 9 from Section 4.3, we obtain the following algorithm. The correctness is proved similarly. if m i = 1 then y ← y 3 × x 7: end for 8: return y D.2 High-order inversion in S/3 We describe the high-order masking of the previous FastExpo3 algorithm.
-If no variable has been probed in H i (t 2 = 0), the t 1 = t probed variables from G i−1 can be simulated from at most t inputs since G i−1 is assumed to achieve t − NI.
-If at least one variable has been probed in H i (t 2 > 0), we consider the t 3 and t 4 variables probed in LinearRefresh and the rest of H i respectively. We can construct a subset O ⊂ [1, i + 1] such that the t 4 probes in the rest of H i can be simulated from the outputs a |O of LinearRefresh and m i+1 , with |O| ≤ t 4 . We apply lemma 1 to LinearRefresh with the t 3 internal probes and the output probes corresponding to O, with t 3 + |O| ≤ t 3 + t 4 ≤ t 2 . Since by convention no inputs of LinearRefresh has been probed, there exists a subset I H such that the above probes can be perfectly simulated from the inputs a |I H of H i , with |I H | ≤ t 2 − 1. Therefore the t 2 probes in H i can be simulated from a |I H and m i+1 . Finally, applying the induction hypothesis on G i−1 , we obtain a subset I ⊂ [1, i] such that the t 1 internal probes and outputs a |I H can be perfectly simulated from m |I with |I| ≤ t 1 + |I H | ≤ t 1 + t 2 − 1 = t − 1. Finally, the t probes in the full G i gadget can be perfectly simulated from m |I with I = I ∪ {i + 1} and |I | ≤ t as required.
In both cases, the t probes in the G i gadget can be perfectly simulated using at most t inputs, which terminates the proof.