Probing Security through Input-Output Separation and Revisited Quasilinear Masking ∗

. The probing security model is widely used to formally prove the security of masking schemes. Whenever a masked implementation can be proven secure in this model with a reasonable leakage rate , it is also provably secure in a realistic leakage model known as the noisy leakage model . This paper introduces a new framework for the composition of probing-secure circuits. We introduce the security notion of input-output separation (IOS) for a refresh gadget. From this notion, one can easily compose gadgets satisfying the classical probing security notion –which does not ensure composability on its own– to obtain a region probing secure circuit. Such a circuit is secure against an adversary placing up to t probes in each gadget composing the circuit, which ensures a tight reduction to the more realistic noisy leakage model. After introducing the notion and proving our composition theorem, we compare our approach to the composition approaches obtained with the (Strong) Non-Interference (S/NI) notions as well as the Probe-Isolating Non-Interference (PINI) notion. We further show that any uniform SNI gadget achieves the IOS security notion, while the converse is not true. We further describe a refresh gadget achieving the IOS property for any linear sharing with a quasilinear complexity Θ( n log n ) and a O (1 / log n ) leakage rate (for an n -size sharing). This refresh gadget is a simpliﬁed version of the quasilinear SNI refresh gadget proposed by Battistello, Coron, Prouﬀ, and Zeitoun (ePrint 2016). As an application of our composition framework, we revisit the quasilinear-complexity masking scheme of Goudarzi, Joux and Rivain (Asiacrypt 2018). We improve this scheme by generalizing it to any base ﬁeld (whereas the original proposal only applies to ﬁeld with n th powers of unity) and by taking advantage of our composition approach. We further patch a ﬂaw in the original security proof and extend it from the random probing model to the stronger region probing model. Finally, we present some application of this extended quasilinear masking scheme to AES and MiMC and compare the obtained performances.


Introduction
In cryptography, side-channel attacks are all attacks based on extracting information from a physical implementation of a cryptosystem. Rather than exploiting some weakness in the underlying cryptographic algorithm, the leakage information is exploited by attackers to extract the secret key from a specific implementation.
Probing security is a notion put forward by Ishai, Sahai and Wagner in [31] to evaluate the security of a circuit against a class of physical attacks. Specifically, they consider t-probing attacks in which the adversary has the ability to place some probes on t wires of a circuit processing some secrets. The circuit is said to be t-probing secure if no information leaks from the values of the t probed wires. More formally, one should be able to perfectly simulate the distribution of the probed wires without any knowledge on the secrets. In their paper, Ishai et al. propose a scheme, the so-called ISW scheme, to compile a circuit into a new randomized circuit (i.e. a circuit featuring random generation gates) which is resistant to t-probing attacks. Their scheme used some additive secret sharing (a.k.a. Boolean masking) of the processed variables. Specifically, each variable x is split into n ≥ 2 variables x 1 , x 2 , . . . , x n , called the shares, which are uniformly distributed among n-tuples satisfying x = x 1 + x 2 + · · · + x n (where + is the addition on F 2 in the original scheme).
Using such an additive sharing to protect a cryptographic computation was already proposed in 1999 as a protection against side-channel attacks [20,28]. Many masking schemes describing efficient implementations of ciphers protected at some given (low) orders were published in the early 2000's, see e.g. [35,3,36]. In this context, the probing security notion is analogous to the security against so-called higher-order sidechannel attacks. In such an attack, an adversary uses t leakage points from a power consumption trace (or electromagnetic trace) to extract information on the secret. If properly implemented, a t-probing secure scheme achieves provable security against this kind of attacks. The ISW scheme had hence a strong impact on the side-channel research community and it was used as a building block in many popular masking schemes, see e.g. [41,33,18,24,22,40,44,30,11,12,32].
Although an ISW-based masking scheme can achieve some level of resistance against side-channel attacks, the probing security notion is not fully satisfactory in this context. In practice a side-channel adversary gets some leakage on the full computation and has no reason to limit herself to t leakage points. Nevertheless, the side-channel leakage is often (or can be made) noisy and the noise is known to be amplified by the masking order [20]. This was the motivation behind the formal noisy leakage model introduced by Prouff and Rivain in [39]. In this model, every variable (or wire) x in the computation leaks a noisy function f (x). The noisy property is captured by assuming that the bias introduced in the distribution of x by an observation of f (x) is smaller than some bound δ.
Subsequently, Duc, Dziembowski and Faust showed that the security in the noisy leakage model could be obtained for a probing-secure scheme through a security reduction [25]. In a nutshell, the so-called DDF reduction considers an intermediate leakage model called the random-probing model, which was already considered by Ishai et al. in [31] and formalized by Ajtai in [2], in which each variable (or wire) is leaked to the adversary with a given probability p. By applying the Chernoff bound, one gets that a t-probing secure circuit C is also p-random probing secure with p = O(t/| C|) (where | C| denotes the number of wires of C). Duc et al. could then show a transition from the p-random probing security to the δ-noisy leakage security with δ = O(p/|K|) where |K| is the base field of the computation. It was recently shown that the impact of the field size can be relaxed by refining the granularity of the computation [29] or considering alternative definitions of the noisy leakage model [38].
The DFF reduction and the obtained security in the noisy leakage model is thus mainly impacted by the leakage rate (or probing rate) which is the ratio between the number of tolerated probes and the size of the circuit [6]. In order to tolerate a significant leakage parameter δ = O(t/| C|), the leakage rate should be as close as possible to 1. In particular, one should be able to tolerate a number of probes that grows linearly with the circuit. To this aim, the circuit should achieve the stronger notion of region probing security formalized by Andrychowicz, Dziembowski, and Faust in [6], namely it should be separable into regions that each tolerate some amount of probes independently of the total size of the circuit. This notion was already considered in the work of Ishai et al. and their scheme was shown to be region probing secure. Specifically, it can tolerate up to t < n/2 probes per protected gate, or gadget, for a masking order n. Since the ISW gadgets require O(n 2 ) operations, the obtained leakage rate is of O(1/n). Such a leakage rate is not fully satisfactory since it implies that the leakage noise should decrease linearly with the number of shares. In particular, no security can be obtained for the ISW gadgets in the context of a constant leakage rate (i.e. on a given target device) and some practical attacks were exhibited to underline this issue [9].
Fortunately, some schemes are known that achieve constant (or quasi-constant) leakage rates. Such a scheme was first proposed by Ajtai in [2] which achieves random probing security with leakage rate O (1). Another scheme, partly based on Ajtai's work, was proposed by Andrychowicz, Dziembowski, and Faust in [6] which achieves probing security with leakage rate O(1/ log n), and random-probing security with leakage rate O (1). More recently, Ananth, Ishai and Sahai [5] have proposed a conceptually simpler approach to achieve random-probing security with leakage rate O (1). This approach has been further improved by Belaïd, Coron, Prouff, Rivain and Taleb in [13]. In terms of complexity, all these proposals imply a size of the protected circuit of O(|C|n 2 ) or larger, where |C| is the size of the original circuit. This was recently improved by Goudarzi, Joux and Rivain who proposed a scheme making use of a Fast Fourier Transform-based (FFT) polynomial multiplication to obtain the first construction achieving a O(|C|n log n) complexity with a O(1/ log n) leakage rate. Unfortunately, their security proof has a flaw that we exhibit in this paper. Moreover, their scheme is restricted to working on base fields including the nth powers of unity, which notably excludes fields of characteristic 2 that are yet essential in some cryptographic primitives (such as the AES block cipher).
In [8], Barthe, Belaïd, Dupressoir, Fouque, Grégoire, Strub and Zucchini formalized the notion of composable gadgets which notably allows to prove region probing security. More precisely, they introduced the notion of Strong Non-Interference (SNI) which refines the notion of probing security, by separating between external and internal probes in the circuits. SNI security allows composing masked gadgets since the notion implies that gadgets stop the propagation of dependencies. However, compared to classical probing-secure gadgets, SNI gadgets are usually less efficient than probing-secure ones and require more randomness. Another approach consists in composing SNI gadgets and NI gadgets (a relaxation of the SNI notion) in a careful way to achieve security with better performances (see [14] and references therein). In [19], Cassiers and Standaert introduced the notion of Probe Isolating Non-Interference (PINI) that allows secure composition and efficient implementations. It relies on the position of probes in a target implementation. Thanks to this notion, linear functions are directly composable and do not require to be refreshed and non-linear operations remain efficient. A circuit achieves PINI-security (and is consequently probing secure) if all its gadgets are PINI but the notion is not sufficient to achieve region-probing security.
In this paper, we introduce a new composition framework to construct circuits (or masked implementations) satisfying the region probing security notion. For this purpose, we formalize the property of input-output separation (IOS) for a refresh gadget and we show that it allows to simply construct region probing circuits from (weaker) probing secure gadgets and in particular more efficient gadgets which are only proven probing-secure but not SNI, e.g. [29,11]. We show that this notion can be obtained from uniform SNI or PINI refresh gadgets but also with a simpler design, namely a variant of the refreshing algorithm due to Battistello, Coron, Prouff and Zeitoun [10]. It is worth mentioning that the original refreshing gadget from [10] was proven SNI but for our purposes, we simplify and extend it and show that it achieves our new IOS security notion. The proposed variant can be used to refresh any kind of linear sharing with a quasilinear complexity Θ(n log n) and a O(1/ log n) leakage rate (for an n-size sharing).
We then revisit the quasilinear masking scheme of Goudarzi, Joux and Rivain [29] (which we shall call the GJR scheme hereafter). This scheme is based on a polynomial sharing of the form a = i a i ω i , where a is the plain variable and the a i 's are the corresponding shares, and it uses an FFT-based polynomial multiplication to achieve a quasilinear complexity. We describe an improved version of the GJR scheme which works on any base field, including binary fields, and which relies on our composition framework. We further patch a flaw in the original security proof and extend it from the random probing model to the stronger region probing model. Specifically, our improved GJR scheme is secure in the region probing model provided that the underlying FFT algorithm is probing secure. From this ground, we obtain a probing-secure FFT using the approach of [29], that is by relying on a large field |K| = Θ(2 λ ) and taking ω at random. We hence get a region-probing-secure scheme for large fields. For smaller fields, our result is essentially a security reduction from the region probing security of the full scheme to the probing security of the FFT. Finally, we present an application of our extended GJR scheme and compare it with a more standard scheme based on SNI gadgets for two different ciphers: the Advanced Encryption Standard (AES) [1] and MiMC [4]: a cipher with efficient arithmetic representation on a large field. We show that this masking scheme significantly improves the efficiency of the masked cipher for a masking order n ≥ 64 for MiMC and n ≥ 512 for the AES. For the AES instantiation, we present a variant of Gao-Mateer additive FFT [27] with improved efficiency and which may be of independent interest.

Notations
In this paper, K shall denote a finite field. Vectors shall be denoted with bold letters, e.g.
x. For any two random variables (or random vectors) X and Y , we shall write X id = Y whenever X and Y are identically distributed. For some positive integer n ∈ N, we denote by [n] the set {1, 2, . . . , n}. For any two vectors u, v ∈ K n , u, v denotes their inner product. For any finite set I, we denote by |I| the cardinality of I. Let I ⊆ [n] and v = (v 1 , . . . , v n ) ∈ K n , we denote by v |I the |I|-tuple (v i ) i∈I . We shall denote x ← X the action of picking x uniformly at random in some set X , and y ← A(x) the action of defining y as the output of an algorithm A on input x. If A is a probabilistic algorithm, then y ← A(x) is a random assignment of y on input x and for a uniform random tape.

Basic Definitions
Arithmetic circuits. Given a finite field K, an arithmetic circuit is a circuit processing elements of K through simple arithmetic operations. Formally, it is modeled as a directed acyclic graph whose vertices are gates that belong to the following types: -input gate (fan-in 0, fan-out 1) which holds an input value of the circuit, -output gate (fan-in 1, fan-out 0) which receives an output value of the circuit, -constant gate (fan-in 0, fan-out 1) which outputs a constant value of K, -addition gate (fan-in 2, fan-out 1) which outputs the sum (on K) of the two input values, -subtraction gate (fan-in 2, fan-out 1) which outputs the difference (on K) of the two input values, -multiplication gate (fan-in 2, fan-out 1) which outputs the product (on K) of the two input values, -copy gate (fan-in 1, fan-out 2) which outputs two copies of the input. The addition, subtraction and multiplication gates are further called operation gates. The edges of an arithmetic circuit are called the wires. A randomized arithmetic circuit is a arithmetic circuit augmented with a -random gate (fan-in 0, fan-out 1) which outputs a fresh uniform random value of K. Given some assignment of the input gates, all the wires of a circuit can be assigned subsequently following the input-output behavior of the gates, which finally leads to an assignment of the output gates. For an arithmetic circuit C with n input gates and m output gates, we denote y = C(x) ∈ K m the output of C (i.e. the assignment of the output gates of C) on input x ∈ K n (i.e. when the input gates are assigned to x). For a randomized arithmetic circuit C with q random gates, we denote y = C ρ (x) the output of C on input x and such that each random gate outputs a coordinate of ρ ∈ K q . The parameter ρ is then called the random tape of C. Whenever ρ is omitted, y = C(x) denotes the random vector obtained for a uniform distribution of ρ.
Let C be a randomized arithmetic circuit with n input gates, m output gates, q random gates, and let consider that the wires of C are labeled from 1 to s (where s is the total number of wires in C). Then for any set W ⊆ [s] with |W| = t, we shall denote by C ρ W (x) ∈ K t the tuple composed of the assignments of the wires with labels in W on input x ∈ K n and random tape ρ ∈ K q . In particular, each coordinate of C ρ W (x) is a deterministic function of x and ρ. Here again, whenever ρ is omitted, C W (x) denotes the random vector obtained for a uniform distribution of ρ on K q .

Circuit compilers.
We now recall the definition of circuit compilers as formalized in [5] (but adapted to arithmetic circuits). We shall call a K-string any tuple of elements from the base field K.
Definition 1 (Circuit Compiler). A circuit compiler is a triplet of algorithms (Compile, Encode, Decode) defined as follows: • Compile (circuit compilation) is a deterministic algorithm that takes as input an arithmetic circuit C and outputs a randomized arithmetic circuit C. • Encode (input encoding) is a probabilistic algorithm that takes as input a K-string x and outputs a K-string x. • Decode (output decoding) is a deterministic algorithm that takes as input a K-string y and outputs a K-string y. These three algorithms satisfy the following properties: • Correctness: For every arithmetic circuit C of input length , and for every x ∈ K , we have where C = Compile(C). • Efficiency: For some parameter called the encoding order n ∈ N, the running time of Compile(C) is poly(n, |C|), the running time of Encode(x) is poly(n, |x|) and the running time of Decode y is poly(n, | y|), where poly(n, q) = O(n k1 q k2 ) for some constants k 1 , k 2 .
Sharings and gadgets. Let n ∈ N and let v ∈ (K * ) n . A v-linear sharing of x ∈ K is a vector x ∈ K n such that v, x = x. The coordinates of a linear sharing x ∈ K n are called the shares of x. A random vector x is a uniform v-linear sharing of x if v, x = x and x |I is uniformly distributed over K t for any I ⊂ [n] with |I| < n.
Let v-Enc denote a probabilistic algorithm that on input x outputs a uniform v-linear sharing of x. For instance v-Enc(x) performs the following: and returns the vector x = (x 1 , x 2 , . . . , x n ). We further denote v-Dec the deterministic algorithm that on input of a v-linear sharing of x outputs x. This algorithm simply computes the inner product v-Dec(x) = v, x .
For any operation g : (x, y) ∈ K 2 → z ∈ K and for any vector v ∈ K n , a v-gadget of g is a randomized arithmetic circuit with 2n input gates and n output gates, which, on input of a v-linear sharing of x and a v-linear sharing of y, outputs a v-linear sharing of z = g(x, y), for any x, y ∈ K. In particular, G is a v-gadget of g if and only if for every random tape ρ, v-Dec(G ρ (x, y)) = g(x, y). A v-refresh gadget is a randomized arithmetic circuit with n input gates and n output gates, which, on input of a v-linear sharing of x outputs a v-linear sharing of x, for any x ∈ K.

Standard circuit compilers. Consider a family of vectors
n is a v n -gadget for the addition on K, G ⊗ n is a v n -gadget for the multiplication on K, and G R n is a v n -refresh gadget. The standard circuit compiler for (V, G ⊕ , G ⊗ , G R ) with encoding order n is the circuit compiler for which • Encode applies v n -Enc to each coordinate of the input K-string; • Decode applies v n -Dec to each coordinate of the input K-string; • Compile takes an arithmetic circuit C and outputs the randomized arithmetic circuit C such that each addition gate is replaced by an addition gadget G ⊕ n followed by a refresh gadget G R n , each multiplication gate is replaced by a multiplication gadget G ⊗ n followed by a refresh gadget G R n , each constant gate outputting α is replaced by n constant gates with constants (α · v −1 1 , 0, . . . , 0) followed by a refresh gadget G R n and each copy gate is replaced by a copy of the input sharing (through n copy gates) followed by a refresh gadget G R n per output sharing. It is not hard to see that such a circuit compiler achieves correctness and efficiency, provided, for the latter, that the sizes of the gadgets G ⊕ n , G ⊗ n and G R n are polynomial in n. To ease the presentation, we restrict the notion of standard circuit compiler to three types of gadgets (addition, multiplication and refresh) but in practice we consider compilers for which the addition gadget is replaced by a broader class of sharewise gadgets. These gadgets apply a linear operation (addition, subtraction, multiplication by a constant, or any K 0 -linear operation if K is an K 0 -module) sharewisely to the input linear sharing(s).

Probing Security
Throughout the paper, the notion of simulator will refer to a polynomial-time probabilistic algorithm. We will say that a random vector w can be perfectly simulated (possibly given some input in) if there exists a simulator S that (given in) outputs a vector which is identically distributed as w (over the internal randomness of the simulator), which shall be denoted S(in) id = w. Informally speaking, a randomized arithmetic circuit achieves t-probing security, if leaking the value of t arbitrary wires (i.e. allowing t probes on the circuit) does not reveal any information about the input (provided that the latter has been properly encoded). This is formally define hereafter. Definition 2 (Probing Security). A randomized arithmetic circuit C is t-probing secure w.r.t. an encoding algorithm Encode if for every plain input x and for every set W ⊆ | C| , with |W| ≤ t, there exists a simulator S C,W such that A circuit compiler (Compile, Encode, Decode) is said to achieve t-probing security if for every arithmetic circuit C, the randomized arithmetic circuit C = Compile(C) is t-probing secure w.r.t. Encode. Note that factually, the parameter t is a function of the encoding order n. For instance, the first probing-secure scheme due to Ishai, Sahai and Wagner achieves t-probing security with t ≤ (n − 1)/2 and an efficiency | C| = Θ(n 2 |C|).
Most probing-secure circuit compilers are based on the composition of gadgets. These gadgets are themselves probing-secure w.r.t. the underlying encoding scheme but they must also satisfy composition properties so that the overall compiled circuit is probing secure. In particular, the notions of (strong) non-interference, or (S)NI and probe isolating non-interference, or PINI have been proposed and studied in [8,14,7,19]. In this paper, we introduce another notion called input-output separation (see Section 3) which is aimed to enable the composition for a stronger notion of probing security, namely the region probing security. In a nutshell, a circuit is region probing secure if it is composed of several sub-circuits (e.g. several gadgets) that can each tolerate some constant amount of probes (irrespective of the total number of sub-circuits). We shall then consider the probing rate (or leakage rate) of such a circuit as the maximum ratio between the number of tolerable probes over the size of a sub-circuit. Region probing security is formalized hereafter.
Let us first introduce the notion of circuit partition. For any (randomized) arithmetic circuit C, we call C ≡ (C 1 , C 2 , . . . , C m ) a circuit partition where each C i is a sub-circuit of C such that the gates of the C i 's form a partition of the gates of C. We further denote by W Ci the set of wires with source gate in C i , so that W C1 , . . . , W Cm is a partition of [|C|].

Definition 3 (Region Probing Security).
A randomized circuit C is r-region probing secure (i.e. with probing rate r) w.r.t. an encoding algorithm Encode if there exists a circuit partition C ≡ (C 1 , C 2 , . . . , C m ) such that for every plain input x and for every set A circuit compiler (Compile, Encode, Decode) is r-region probing secure if for every circuit C the compiled circuit C = Compile(C) is r-region probing secure w.r.t. Encode (where r might be a function of the encoding order and the circuit size).
We shall further say that a circuit C is (r, ε)-region probing secure (i.e. with probing rate r and simulation failure ε), if the simulator fails (i.e. returns ⊥) with probability (m being the number of regions) and returns a perfect simulation otherwise: The region probing security is a relevant security property for a cryptographic implementation while considering side-channel attacks. Indeed, security in the so-called noisy leakage model which captures the physical reality of power and electromagnetic side-channel leakages can be reduced to region probing security. These notions and reductions are recalled in Appendix B.

Input-Output Separation
We introduce hereafter the input-output separation security notion for a refresh gadget. Such a property has originally been used in the GJR scheme to achieve composition in the random probing model [29]. We formalize this notion hereafter as general composition property to achieve region probing security. For the sake of simplicity, the definition given in this section only considers refresh gadgets but it can be generalized to any kind of gadgets (see Appendix A for a general definition).
We first introduce the notion of uniformity for a gadget which will be a requirement for our new security notion.
In the following, we shall say that a pair of vector (x, y) ∈ (K n ) 2 is admissible for a gadget G if there exists a random tape ρ such that y = G ρ (x). For an admissible pair (x, y) and a set W ⊆ [|G|], the wire distribution of G in W induced by (x, y), denoted G W (x, y), is the random vector G ρ W (x), i.e. the tuple of wire values for the wire indexes in W, obtained for a uniform drawing of ρ among the set {ρ ∈ K q ; G ρ W (x) = y}. Definition 5 (IOS). Let v ∈ (K * ) n and let G be a v-refresh gadget with s wires. G is said t-input-output separative (t-IOS), if it is uniform and if for every admissible pair (x, y) and every set of wires W ⊆ [s] with |W| ≤ t, there exists a (two-stage) simulator G,W such that  A v-refresh gadget is simply said to be IOS if it is n-IOS.
The above definition generalizes the notion of input-output linear separability used in the GJR scheme [29]. Our definition has two differences with the GJR notion: • the GJR notion requires a deterministic (functional) relation between the probed wires and the input/output shares whereas we only require the ability of simulating the probed wires from some input/output shares; • the GJR notion requires the knowledge of arbitrary linear combinations of the input/output shares whereas we require the knowledge of some input/output shares. The first difference makes our definition easier to achieve without impacting the composability. Indeed, in any probing security context, the ability of achieving a perfect simulation is sufficient to prove the security. The second difference makes our definition harder to achieve 1 but more useful to different composition contexts (where the probing security might not rely on linear algebra). Moreover, we describe in Section 4 a refresh gadget achieving our version of input-output separation.
The intuition behind the IOS notion can be understood as follows. Any probing leakage from an IOS refresh gadget can be simulated given a subset of its input shares and output shares. We can therefore reduce the standard region probing security game to a game in which the refresh gadget does not leak anything but its surrounding gadgets leak more. The uniformity property then implies that the leakages from two gadgets separated by a refresh gadget are mutually independent. One can then achieve a perfect simulation of the full leakage through independent simulations of the separated leakages from the two gadgets.

Plain world
Encoded world This is illustrated on Figure 1. The full probing leakage (w 1 , w R , w 2 ) can be simulated from (w 1 , y |I , y |J , w 2 ). Moreover, the refresh uniformity implies that, given x, the separated leakages (w 1 , y |I ) and (w 2 , y |J ) are mutually independent. Therefore, if one can simulate (w 1 , y |I ) on the one hand and (w 2 , y |J ) on the other hand, then one can simulate the full leakage.

Composition Theorem
We now provide a formal proof of composition based on the IOS property defined above. Specifically, we show that a standard circuit compiler interleaving operation gadgets and refresh gadgets is region probing secure provided that its operation gadgets are probing secure, and its refresh gadgets are IOS.
As introduced in Section 2, we consider hereafter a family of vectors V = {v n ∈ K n } n∈N and three families of gadgets n is a v n -gadget for the addition on K, G ⊗ n is a v n -gadget for the multiplication on K, and G R n is a v n -refresh gadget. The following theorem gives our composition result for the standard circuit compiler for (V, G ⊕ , G ⊗ , G R ).
Proof. Let n ∈ N and let t ≤ t R n . Let C be an arithmetic circuit composed of m operation gates, and let C be the randomized arithmetic circuit obtained by calling the standard circuit compiler for (V, G ⊕ , G ⊗ , G R ) on C. We shall denote by G 1 , G 2 , . . . , G m the operation gadgets of C and by G R 1 , G R 2 , . . . , G R m the refresh gadgets of C where G R i is placed in output of G i for every i. We further denote by W Gi and W G R i the set of wires with source gate in G i and G R i respectively. Finally, we denote t i the integer such that . We will show that for any input in of C, there exists a simulator S C,W such that which directly implies the r n -region probing security of the standard circuit compiler with The above shall hold for every t ≤ t R n which yields the maximum in (1). The simulator S C,W is simply obtained by running the simulators inherited from the probing security of the G i 's and the IOS property of the G R i 's. Specifically: , with |I| ≤ |W| and |J| ≤ |W|; given that the pair of input/output sharings of G R i equals (x, y).
Here x |I corresponds to |I| ≤ t (output) wires of the gadget G i and y |J corresponds to |J| ≤ t (input) wires of the gadget G j subsequent to the refresh G R i . In particular, there exist two sets I i ⊆ W Gi and J i ⊆ W Gj such that x |I = C Ii (Encode(in)) and y |J = C Ji (Encode(in)) .
• Let φ and ψ be the index-mapping functions such that the two input sharings of gadget G i are output sharings of refresh gadgets G R φ(i) and G R ψ(i) . By defining • The probing security of the G i 's implies that, for every i ∈ [m], there exists a simulator S Gi,Wi such that We now have all the ingredients to describe the simulator S C,W . It proceeds as follows: to get the sets I i 's and J i 's.
2. S C,W then calls the simulators S Gi,Wi (⊥) to get tuples By the uniformity property of the refresh the distributions C W G i (Encode(in)), given in, are mutually independent which implies 3. S C,W finally calls the simulators S which concludes the proof.

Comparison with Non-Interference Security Notions
It is well-known that composition of probing secure gadgets is not always probing secure [23]. Stronger security definitions were previously proposed to analyse the security of large circuits viewed as the composition of simple gadgets. The first such notion, (strong) noninterference, or (S)NI, was proposed in [8]. The notion of Probe Isolating Non-Interference (PINI) was also recently introduced in [19]. In this section, we compare our composition approach with the ones underlying the (S)NI and PINI notions and then show some implications between these notions and our IOS notion. We first recall the (S)NI and PINI definitions while extending them from standard Boolean sharing to the general case of v-sharings. For a v-refresh gadget, the (S)NI notion is defined as follows: Definition 6 (NI and SNI). Let v ∈ (K * ) n and let G be a v-refresh gadget with s wires. G is said t-Non-Interferent (t-NI) (resp. t-Strong Non-Interferent (t-SNI)), if for every x and every set of internal wires W ⊆ [s] with |W| ≤ t 1 and every set of A v-refresh gadget is simply said to be NI (resp. SNI ) if it is (n−1)-NI (resp. (n−1)-SNI ).
If a gadget achieves NI-security, then a probe of an internal wire or an output wire can be simulated using one probe on each of the input sharings of the gadget. If it achieves the stronger SNI-security notion then only probes of internal wires are propagated to inputs (and it thus guarantees independence between the inputs and outputs even with access to the internal wires).
For a v-refresh gadget, the PINI notion is defined as follows: G,W,O such that Comparison of the composition approaches. We discuss hereafter the composition approaches related to the (S)NI notion, the PINI notion and our new IOS notion.
(S)NI composition approach. The NI and SNI notions were proposed in [8] as composition notions for probing-secure gadgets. The authors show how to compose t-NI and t-SNI gadgets to achieve t-probing security, which was further generalized in [14]. Theses results can actually be extended to region probing security. Let us consider the standard circuit compiler as defined in Section 2. If the underlying refresh gadget is SNI and the underlying addition and multiplication gadgets are NI, then it can be checked that the compiled circuit can tolerate up to t/2 probes per gadget. In other words, from an SNI refresh gadget, one simply needs NI operation gadgets to obtain a region probing-secure composition.
PINI composition approach. The PINI notion was introduced to allow trivial composition of probing-secure gadgets [19]. Specifically composing any number of PINI gadgets in any way results in a circuit achieving PINI security which further implies probing security. Another advantage of the PINI notion is that it is satisfied by any sharewise gadget (i.e. a gadget which simply applies an operation sharewisely) without requiring any refreshing or randomness. Although the PINI notion enables simpler composition, it is limited to probing security (or PINI security) and cannot be extended to region probing security. To illustrate this impossibility, let us consider the following simple example. Suppose that some circuit compiler applies a single-input sharewise gadget G (for instance squaring on F 256 ) successively many times to an input n-sharing x. After N gadgets each leaking t probes, all the shares can be recovered whenever N > n/t.

IOS composition approach.
Our composition approach consists in interleaving an IOS refresh gadget between any pair of successive operation gadgets of the compiled circuit (as in the definition of the standard circuit compiler). Doing so, we can lower the requirement on the operation gadgets: they simply needs to achieve the weaker notion of probing security (see Theorem 1 above).
Comparison. We compare the three composition approaches for the standard circuit compiler as introduced in Section 2. This compiler basically replaces each gate by the corresponding operation gadget and it interleaves a refresh gadget in each connection between two operation gadgets. Assuming that the refresh gadget satisfies a given notion in {SNI, PINI, IOS}, we look at (i) what is the security notion required for the operation gadgets? (ii) what is the obtained security notion for the composed circuit?
• SNI: (i) The NI notion is sufficient for the operation gadgets.
(ii) The composition of NI operation gadgets and SNI refresh gadgets implies the region probing security of the composed circuit.
• PINI: (i) The PINI notion is sufficient for the operation gadgets.
(ii) The composition of PINI gadgets implies the probing security of the composed circuit. Let us stress that with PINI operation gadgets, PINI refresh gadgets are actually useless.
• IOS: (i) The probing security is sufficient for the operation gadgets.
(ii) The composition of probing-secure operation gadgets and IOS refresh gadgets implies the region probing security of the composed circuit.
Our composition approach, hence achieves the stronger notion of region probing security from the weaker notion of probing security for operation gadgets based on the IOS security of the refresh gadget.

Relations between (S)NI, PINI and IOS.
Besides their differences in terms of composition approach, one might question the relation between usual non-interference notions and IOS. Can we show some form of equivalence, one-way implication, or separation? We leave this issue open for further research.

An Input-Output Separative Refresh Gadget
Battistello, Coron, Prouff and Zeitoun describe in [9] so-called (template) horizontal side-channel attacks against the ISW [31] and the Rivain-Prouff [41] secure multiplication schemes. These attacks exploit the fact that, for those schemes, the leaking information on each share increases with the number of shares in the presence of a constant leakage rate. Battistello et al. describe a variant of the ISW multiplication with probing-security that is heuristically secure against this kind of attacks. In the full version of their paper [10], they further propose a new refreshing gadget with complexity O(n log n), which we shall refer to as the BPCZ gadget hereafter. In this section, we simplify and extend this gadget for any v-linear sharing and we prove that the obtained variant achieves the IOS security notion.

Refresh Gadget Description
Starting from the BPCZ gadget, our approach consists in • using a single (post-processing) randomization layer instead of two (pre-processing and post-processing) in the algorithm recursion, • introducing necessary multiplication by constants to support v-linear sharings, • calling the obtained variant of the BPCZ gadget to generate a fresh sharing of 0 which is then used to refresh the input sharing by addition.
The procedure ZeroEncoding which generates a fresh v-linear sharing of 0 is described in Algorithm 1. It is defined recursively: for n = 2, it outputs y = (z 1 , For n ≥ 4 a power of 2, ZeroEncoding is called recursively to produce two halves of the sharing (Steps 4-5) and a post-processing layer is applied to the whole sharing (Steps 6-9). Note that the original refresh gadget proposed in [9] makes use of an additional and similar pre-processing layer before the two recursive calls. It results that our variant is twice more efficient in terms of computation and randomness generation.

Algorithm 1 ZeroEncoding
Let us denote R(n), A(n) and M (n) the randomness complexity, the number of additions and the number of scalar multiplications of the ZeroEncoding algorithm for length-n linear sharing. We have R(2) = 1, A(2) = 0 and M (2) = 1 and R(n) = 2R( n 2 ) + n 2 , A(n) = 2A( n 2 ) + n and M (n) = 2M ( n 2 ) + n 2 for all n ≥ 2. By induction, we thus have for any n ≥ 2, a power of 2, We have n further additions in RefreshGadget.
Remark 1. In the original version of this paper, we suggest to directly apply the BPCZ variant (without pre-processing layer) to the input sharing x and provide a proof of IOS for this refresh gadget. However, a flaw in this proof was reported to us by Gaëtan Cassiers. This flaw is solved while using the BPCZ variant to generate a sharing of 0 which is then added to the input sharing x. We note that the obtained refresh gadget (BPCZ without pre-processing layer, generating a sharing of 0) was also considered by Mathieu-Mahias in [34]. The author shows that this gadget achieves the SNI property.

Proof of Input-Output Separation
Theorem 2. The refresh gadget from Algorithm 2 is input-output separative.
Proof. Throughout the proof, we denote by L = n 2 and H = [n] \ L.
For this purpose, we simply need to show that ZeroEncoding(v) outputs a uniform v-linear sharing of 0. The proof is by induction on n.
For n = 2, where r is picked uniformly at random in K. This is clearly a uniform sharing satisfying v, z .
Denoting r i = s i + r i (= z i ) for i ∈ L, the vector (r 1 , . . . , r n/2 ) is uniformly distributed in K n/2 and we have z i+n i+n/2 for i ∈ L where s |L and s |H are uniform and independent v |L -linear sharing and v |H -linear sharing of 0. We obtain that z is uniformly distributed among the vectors of K n satisfying v, z = 0, namely z is a uniform v-linear sharing of 0.
IOS. For the sake of simplicity, we show the IOS property for the particular case of v = (1, 1, . . . , 1). This way we can ignore the multiplications by constant factors from the vector coefficients. The argument applies in the exact same way (but with heavier notations) for the general case.
Let w 1 , . . . , w m denote some random vectors. In the scope of this proof, we shall say that a random vector x ∈ K is -free with respect to w 1 , . . . , w m , if any ( − 1)-subtuple of x is uniformly distributed on K −1 and mutually independent of the joint distribution of (w 1 , . . . , w m ). The core of the proof consists in showing the following property of the ZeroEncoding gadget: For every set W of probed wires in ZeroEncoding, with |W| = t, and denoting w the corresponding wire distribution, there exists a set K, with Property (3) is direct for n = 2. If t = 0, it holds from the uniformity of the sharing produced by ZeroEncoding, while for t ≥ 1, the property always holds for the case n = 2. Let us now show this property by induction: we assume that it holds for n/2 and show that it then holds for n.
We denote ZE 1 the gadget corresponding to the first recursive call to ZeroEncoding (Step 4), ZE 2 the gadget corresponding to the second recursive call to ZeroEncoding (Step 5) and M the gadget corresponding to the post-processing layer (Steps 6-9). We denote by W 1 , W 2 , and W 3 , the subset of W corresponding to wire indexes from ZE 1 , ZE 2 , and M respectively, so that W = W 1 ∪ W 2 ∪ W 3 . Without loss of generality, all the outputs of ZE 1 and ZE 2 , which are also inputs of M, are included to W 1 and W 2 , but not to W 3 . We denote t 1 = |W 1 |, t 2 = |W 2 |, and t 3 = |W 3 |, and consider the case t 1 + t 2 + t 3 = |W| < n (since for |W| ≥ n the property is trivial). We denote w 1 , w 2 , and w 3 , the wire distributions corresponding to W 1 , W 2 , and W 3 so that As in Algorithm 1, s = (s 1 , . . . , s n ) denotes the linear sharing in output of the block (ZE 1 ZE 2 ), and s |L = (s 1 , . . . , s n 2 ) and s |H = (s n 2 +1 , . . . , s n ) are the respective output of ZE 1 and ZE 2 . These notations are illustrated on Figure 2.
Without loss of generality, we assume t 1 ≤ t 2 . Applying Property (3) to ZE 1 (which holds for n/2 by assumption), we obtain that there exists a set K ⊆ L, with := |K| = n 2 − t 1 , such that s |K is -free w.r.t. s |L\K and w 1 . Without loss of generality, we further assume that K = [ ] (this does not change the argument but eases the notations).
Let us define K as the sumset We show hereafter that z |K is (2 ) Then, z |K can be expressed as with u 1 , . . . , u 2 −1 , fresh uniform random variables, mutually independent of w 1 , w 2 , s |[n]\K and z |[n]\K . We thus get that z |K is (2 )-free w.r.t. w 1 , w 2 , s |[n]\K and z |[n]\K . We shall now explain how to update the set K to take into account the probes from W 3 , i.e. while ensuring that z |K is further free w.r.t. w 3 . Each variable in w 3 is either a random r i or an output share z i . For each r i in w 3 , with i ∈ K, we remove n 2 + 1 from K . This amounts to removing the coordinates c i − r i from z |K (see Equation 4). One can check that doing so, and treating r i as a "constant" c i , a change of variables still yields an expression like Equation 5, where the random u i 's are independent of w 1 , w 2 , s |[n]\K , z |[n]\K and the probed r i 's from w 3 . For each z i in w 3 , with i ∈ K , we further remove i from K . This amounts to removing z i from z |K . From Equation 5, it is clear that removing one of the coordinates does not change the form of the distribution. Moreover the subtuples of z |K , where K denotes the updated set, are mutually independent of the removed coordinates z i . We hence get that the updated tuple z |K is |K |-free w.r.t. w 1 , w 2 , z |[n]\K and w 3 . Moreover, the update process removes at most t 3 elements from K , which implies In case the last inequality is strict, one can remove additional coordinates from K to get |K | = n − (t 1 + t 2 + t 3 ). We conclude that Property (3) is satisfied for n.
It remains to show that Property (3) for ZeroEncoding implies IOS for RefreshGadget. We consider a set of probed wires W = W 1 ∪ W 2 ∪ W 3 ∪ W 4 , where W 1 , W 2 , and W 3 are the three sets of wires from ZeroEncoding as considered above, and W 4 are wires from RefreshGadget (excluding ZeroEncoding). The corresponding wire distribution w 4 only contains input shares x i or output shares y i (since the z i coordinates are included to w 3 w.l.o.g.). Let us denote W 3 the wires corresponding to the z i 's for indexes i such that x i or y i is in w 4 We now explain how to perform a perfect simulation S , y), i.e. how to perfectly simulate for any admissible pair of sharings (x, y). We first note that for any x i or y i in w 4 , the corresponding z i is included to w 3 (from set W 3 ), hence by construction the index i is excluded from the set K and is thus included in the sets I and J. We can then trivially simulate all the probed input / output shares, i.e. the coordinates of w 4 . We can further perfectly simulate from the input shares x |I and the output shares y |J . Now by definition of ZeroEncoding all the wire values are defined as linear combinations of random elements sampled from K (Steps 2 and 7 of Algorithm 1). We can then write where r denotes the vector of randomly sampled elements from K and M is a matrix with coefficients in K. We can perfectly simulate w 1 , w 2 and w 3 by picking a random r for which Equation 8 matches the simulated z |[n]\K from Equation 7. Now according to Property (3), any subtuple of z K is mutually independent of the above simulation, while the last degree of freedom is defined such that z is a sharing of 0. This ensures that the above simulation is consistent with any value of z |K := y |K − x |K for an admissible pair of sharings (x, y).

Revisiting the GJR Masking Scheme
In this section, we revisit the quasilinear-complexity Goudarzi-Joux-Rivain (GJR) masking scheme [29]. We first describe a variant of this scheme making use of the IOS refresh gadget described above and which is more general than the original scheme in the sense that it works on any base field K equipped with an Fast Fourier Transform (FFT) for multiple-point polynomial evaluation. We then show that the use of our refresh allows to patch a flaw in the security proof of the original scheme. We shall refer to the improved GJR scheme as the GJR + scheme hereafter.
For such a vector, a sharing x = (x 1 , x 2 , . . . , x n ) of a plain value x ∈ K can be seen as the coefficients of a polynomials P The quasilinear complexity can then be achieved by using efficient FFT-based multiplication for the multiplication gadget. Note that such encoding is close to but different from Shamir's secret sharing [42]. In the latter the shares are defined as evaluations of a polynomial in fixed points and for which the plain value is the degree-0 coefficient.
We assume the existence of a Fast Fourier Transform (FFT) algorithm that, given any polynomial P ∈ K[α] of degree < 2n, maps the coefficients of P to the evaluations of P over 2n points of K, with a complexity of O(n) operations. That is: for every j ∈ [2n], for some α = (α 1 , α 2 , . . . , α 2n ) ∈ K 2n . We further assume that this FFT algorithm can be written as an arithmetic circuit on K solely composed of additions, subtractions and multiplication by constants in K, and that it features an inverse FFT algorithm with the same properties (in terms of type and number of operations).
The GJR + scheme is a standard circuit compiler for (V, ω ∈ K n is the vector defined in (9). As in the original scheme, we assume in the following that the order n is a power of two. The scheme could be easily extended to deal with non-power of two at the cost of a small constant efficiency factor.
We now give the description of the associated v (n) ω -gadgets. For the sake of clarity we shall omit the superscript and simply note v ω in what follows.
Refresh Gadget. We use the refresh gadget of Section 4 (see Algorithm 2) for v ω -sharings i.e. with encoding vector v assigned to v ω . This refresh gadget is applied in output of each operation gadget (in accordance to the definition of the standard circuit compiler). We recall that this gadget achieves the uniformity and IOS properties defined in Subsection 3.1.
Remark. This multiplication gadget is similar to the GJR multiplication gadget but we introduce a refreshing in Step 4. This refreshing is done using Algorithm 2 (see Section 4) where the encoding vector v ω and the input sharing are of size 2n.
Correctness. Let x and y be the values encoded by x and y respectively and let P x ∈ K[α] and P y ∈ K[α] be the degree-(n − 1) polynomials whose coefficients are the coordinates of x and y, so that we have P x (ω) = x and P y (ω) = y.
Let us first assume that Step 4 applies an identity mapping, i.e. u = u. Then Steps 1-5 perform a classical FFT-based polynomial multiplication. Namely, the coordinates of t are the coefficients of the polynomial P t ∈ K[α] such that P t (α) = P x (α) · P y (α), and in particular P t (ω) = x · y. Then Step 6 outputs a vector z such that v ω , z = P t (ω) = x · y, i.e. a v ω -sharing of x · y.
Let v ω = (1, ω, ω 2 , . . . , ω 2n−1 ), then we have By correctness of the FFT-based polynomial multiplication, we hence have that u = FFT α (t) is a v ω -sharing of x · y. Let us now consider the actual multiplication gadget with refreshing at Step 4. By correctness of the refresh algorithm, u is also a v ω -sharing of x · y, and by the above relation we have that v ω , u = x · y implies v ω , FFT −1 α (u ) = x · y, which is v ω , t = x · y. We hence get the correctness of the multiplication gadget.

Scalar Multiplication Gadget.
For the particular case of a multiplication by a constant, a dedicated scalar multiplication gadget can be used which is much more efficient than a regular multiplication gadget. Given a v ω -sharing x = (x 1 , . . . , x n ) and a constant α ∈ K, the scalar multiplication gadget outputs This is done via n multiplication gates processing each share separately. Hence this scalar multiplication gadget achieves (n − 1)-probing security.
Square Gadget. For the particular case of a field K of characteristic 2, a square can be computed through a dedicated gadget much more efficiently than with a regular multiplication gadget. Given a v ω -sharing x = (x 1 , . . . , x n ) of x, the square gadget outputs for every i ∈ [n]. We then have v ω , y = v ω , y 2 by linearity of the squaring on a field of characteristic 2, which implies that y is indeed a v ω -sharing of x 2 . The square gadget involves 2n multiplication gates processing each share separately. Hence this square gadget achieves (n − 1)-probing security. More generally, a sharewise gadget can compute any q k -th power on a field of characteristic q (i.e. compute the k-th Frobenius map).
Note that, extending the standard circuit compiler to include such gadget is straightforward but it would make the formalism heavier so we skip this extension from our presentation.

Field Extension and FFT Algorithm
In order to instantiate the GJR + scheme, it is necessary to consider an implementation of secure multiplication at order n over a finite field K and an element ω such that there exists an FFT algorithm which allows quasilinear multiplication of polynomials of degree at most n and coefficients in K and which can be written as an arithmetic circuit on K solely composed of additions, subtractions and multiplication by constants.
A possible approach (which was used in [29]), is to consider finite fields K = F q that contain a (2n)-th root of unity ω (i.e. such that 2n | q − 1). However, most of the time, we cannot choose the underlying algebraic structure and we have to consider a specific cryptographic primitive with a given structure and to implement it securely. In order to extend the original scheme to any finite field F m p for some prime number p (with m ≥ 1), we can use the general additive FFT proposed by Cantor in [45,17]. In this case we can instantiate it at order n over F p where is the minimum even value greater than m such that p ≥ 2n.
In particular, for most symmetric cryptographic schemes, the underlying structure is a finite field of characteristic 2 and over such a binary field, the approach from [29] does not apply at all. For this case of utmost practical importance, we can use the Gao-Mateer additive FFT [27] for secure implementation of multiplications at order n over binary fields F 2 m for m ≥ 2. The Gao-Mateer additive FFT is a variant of Cantor additive FFT that works over finite fields of characteristic 2. Using this transform, if m is even with 2 m ≥ 2n, then we can use directly our technique over K = F 2 m and otherwise we can simply instantiate it over K = F 2 where is the smallest even integer for which 2 ≥ 2n and m | .

Security Reduction
This section provides a security reduction for the GJR + scheme. We show that under the probing security of the FFT, the scheme achieves region probing security. More formally, the reduction is based on the following hypothesis on the FFT algorithm. We can then state our reduction theorem. A discussion of the practical meaning of Hypothesis 1 is given after the theorem proof.

Theorem 3. Under the FFT Probing Security hypothesis and the t R
n -IOS property of the refresh gadget, the GJR + compiler is r n -region probing secure with where |FFT n | denotes the (maximum) number of wires in the FFT circuits for 2n input sharings.
Note that the refresh gadget described in Section 4 satisfies t R n = n − 1 and |G R n | = 3n log n. Assuming the FFT algorithm is quasilinear and that it can tolerate a linear number of probes (in the encoding order n) and denoting |FFT n | = α · n log n |G R n | = β · n log n t FFT n = γ · n for some constants α, β and γ (with γ < 1), one can check that the minimum in Equation 19 is reached for In particular, we obtain a probing rate r n = Θ(1/ log n).
The proof of Theorem 3 is based on the two following lemmas.

Lemma 1. Under the FFT Probing Security hypothesis the circuit processing
is t FFT n -probing secure w.r.t. the v ω -encoding.
Proof. Let us denote by C the considered circuit and W the set of probed wires from C such that |W| ≤ t FFT n . We show how to construct the simulator S C,W that outputs a perfect distribution of C W (x, y) where x and y are uniform v ω -linear sharings. The simulator S C,W first call the simulators S FFT,W1 and S FFT,W2 by constructing W as follows: for every w ∈ W, w is added to W 1 if it corresponds to a wire in the first FFT (i.e. applying to x) and w is added to W 1 if it corresponds to a wire in the second FFT (i.e. applying to y). Whenever w corresponds to a product u i = r i · s i , then the wire corresponding to r i is added to W 1 and the wire corresponding to s i is added to W 2 . By construction, we have |W 1 |, |W 2 | ≤ |W| ≤ t FFT n which ensures that S FFT,W1 and S FFT,W2 output perfect simulations of all the wires in W pertaining to the two FFT circuits (by FFT Probing Security hypothesis). Moreover, by construction, they also output the pairs (r i , s i ) for all the wires in W corresponding to a product u i = r i · s i which can then be perfectly simulated as well.

Lemma 2. Under the FFT Probing Security hypothesis the circuit processing
Proof. The proof follows the same lines as the proof of Lemma 1. Let C denote the considered circuit and W the set of probed wires from C such that |W| ≤ t FFT n /2. The simulator S C,W essentially relies on the simulator S FFT,W where W is constructed as follows: for every w ∈ W, w is added to W if it corresponds to a wire in the FFT. Otherwise, w correspond to a wire in the computation z i = t i + ω n · t n+i for some i, in which case we add the wires corresponding to t i and t n+i to W . By construction, we have |W | ≤ 2 · |W| ≤ t FFT n which ensures that S FFT,W outputs a perfect simulation of all the wires in W , i.e. of all the wires in W pertaining to the FFT plus the pairs (t i , t ni ) for every i such that a wire in the computation z i = t i + ω n · t n+i appears in W. The latter wires can then also be perfectly simulated from the pairs (t i , t ni ), which concludes the proof.
Proof. (Theorem 3) The proof simply holds from Lemma 1 and Lemma 2 by applying the composition theorem (Theorem 1). We further note that the term depending on the addition gadget can be removed from the expression of the probing rate since the latter satisfies t ⊕ n = n − 1 and |G ⊕ n | = 2n which clearly makes it greater than the term depending on the FFT.
Theorem 3 formally shows that if probing security can be demonstrated for the FFT algorithm, then we obtain region probing security for the GJR + scheme. Unfortunately, it is not clear whether the classical FFT algorithms are probing secure or not. To some extent, this open issue is related to the choice of ω: some choices lead to probing insecurity 2 while it is not clear whether some choices can provide probing security. Nevertheless, following the approach of [29] it is possible to obtain random probing security by picking ω randomly on a large enough field K. This is formally stated in Subsection 5.4.
Discussion on Hypothesis 1. We now provide some insights about whether Hypothesis 1 is verified in practice. Given input values K, α and ω, there exists an effectively computable function that checks whether Hypothesis 1 is verified. Indeed, given a v ω -encoding x of the value x, there exists a circuit C of size Θ(n log n) that takes x as input and computes FFT α (x 0). Probing a node i of the circuit reveals u i , x , where the value of the vector u i ∈ K n depends only of α and i. The adversary can recover x = v ω , x if and only if he can probe a subset S such that v ω is in the span of (u i ) i∈S . Indeed, according to Lemma 1 of [29] (see Appendix C): Therefore Hypothesis 1 can be verified by checking, for all Θ(n log n) |S| choices of |S| probes in the NTT circuit, whether the left part of Equation 12 holds, a subtask that can be done via Gaussian elimination. When |S| = Θ(n), the number of subsets to check is superexponential in n but is still tractable via exhaustive search for small values of n.
As an illustration, the circuit C in Figure 4 computes the degree-8 NTT over F 257 for input ( x 0 , x 1 , x 2 , x 3 , 0, 0, 0, 0). Each node i is labelled by a vector u i such that probing i reveals u i , x , with x = (x 0 , x 1 , x 2 , x 3 ). Note that since half the input coefficients are 0, the circuit description is somewhat simpler than a full NTT. By exhaustive search, we can see that C is 3-probing secure for ω = 138, whereas it is only 2-probing secure for ω = 209. For prime fields and a power-of-two NTT, we were able in practice to determine t FFT n only for n ≤ 8, due to combinatorial explosion. This raises the question of proposing algorithms more efficient than exhaustive search for compute t FFT n . Note that t FFT n + 1 is the minimum weight w for which we can find a vector a with at most w non-zero coefficients such that a · U = v ω , where U = (u i ) i ∈ K Θ(n log n)×n . This can be cast as a specific instance of the information-set decoding (ISD) problem, which is common in code-based cryptography. However, unlike ISD instances that are usually studied in code-based cryptanalysis, the matrix U we consider has more lines than columns, is not random but fixed, and the underlying field may be non-binary. We see the question of computing t FFT n more efficiently as an interesting open problem.

Security Proof for Large Fields
The security proof given hereafter follows the same lines as the original proof from [29] but it is more general as it applies to any instance of the GJR + scheme (with any field and FFT algorithm) and it holds in the stronger region probing model. Moreover, our security proof corrects a flaw in the original proof which we exhibit hereafter.
Flaw in the original proof. The original GJR scheme is based on a different refresh gadget for the composition. In a nutshell, their gadget follows the classical approach of adding a random ω-encoding of 0. The latter is generated based on a fixed ω-encoding of 0 denoted e in [29], which is randomly generated at the beginning of the computation and which is considered to be fully leaked to the adversary. When a fresh ω-encoding of 0 must be generated, one draws a random vector u and multiplies it with e through the multiplication gadget, which gives a fresh and uniform ω-encoding of 0. Such a refresh procedure satisfies a slightly weaker version of the IOS property (see Subsection 3.1 for comparison). Specifically, when a sharing x is refreshed into a new sharing x , the leakage from the refresh procedure can be simulated by linear combinations of x and linear combinations of x . These leaking linear combinations can in turn be perfectly simulated with overwhelming probability over the random distribution of ω, provided that ω is defined on a large enough field K (see Lemmas 1 & 2 of [29] which we recall in Appendix C). The flaw in the proof is that it implicitly assumes that the aforementioned leaking linear combinations have constant coefficients with respect to ω. However, by definition of the refresh procedure, these coefficients depend on e, the initial ω-encoding of 0, which cannot be considered as constant with respect to ω. This prevents the application of Lemmas 1 & 2 of [29]. This bug invalidates the composition security proof of the original GJR scheme although it does not lead to an obvious security flaw: it is not clear whether the linear combinations coming from the refresh imply an exploitable information leakage (or equivalently a simulation failure).
New proof. The region-probing security of the GJR + scheme simply holds from the IOS property of the refresh gadget and assuming that the underlying FFT algorithm is somehow linear. This is captured by the following definition. The above definition implies that the value carried by each wire in the FFT circuit can be expressed as a linear combination of the coordinates of the input sharing. This property is necessary to apply the security argument of the original GJR scheme. Note that this requirement is relatively weak since it is satisfied by classical FFT algorithms such as the NTT (used in [29]) and the Gao-Mateer additive FFT [27].

Corollary 1.
If the FFT circuit is linear and made of |FFT n | = α n log n wires, the GJR + compiler is (r n , ε n )-region probing secure with and Proof. Using Lemmas 1 & 2 of [29] (which are recalled in Appendix C for the sake of completeness) and thanks to the linearity of the FFT circuit, we have that for any choice of n − 1 leaking wires from the FFT circuit, the probability that the leaking wires cannot be perfectly simulated is lower than n/|K|. Besides the linearity of the FFT circuit, the only requirement for this upper bound to apply is that the choice of the leaking wires is made independently of ω, which occurs in the region probing model since the placement of the probes by the adversary is done independently of the random generation of ω. We can then directly apply Theorem 3 and obtain the probing rate r n from Equation 11 with γ = n−1 n ≈ 1 and β = 3. In Appendix B, we further detail the security proof of GJR + in the random probing model which holds from its security in the region probing model by applying the Chernoff's bound. This further implies the security of GJR + in the noisy leakage model by the reduction from [25].

Application
In this section, we present an application of our extended GJR scheme and compare it with a more standard scheme based on SNI gadgets. We investigate the masked computation of two different ciphers: • The Advanced Encryption Standard (AES) [1]: a very common application scenario which favors efficient masking schemes on the field F 256 ; • MiMC [4]: a cipher with efficient arithmetic representation on a large field. MiMC has been designed with the aim to minimize the number of multiplications (which makes it particularly amenable to masked computation). We focus on the prime-field variant of MiMC (the base field is a prime field F p ).
For these two application contexts, we described masked computations based on the two following masking schemes: • The GJR + scheme described in Section 5 of this paper, in two modes of application: on a binary field with the Gao-Mateer FFT algorithm (to mask AES), -on a prime field with NTT algorithm (to mask MiMC).
• An extended ISW scheme, that we shall refer to as ISW + , and which is based on the ISW multiplication gadget [31] over the base field K (either F 256 for AES or F p for MiMC), share-wise linear gadgets (for additions, subtractions, F 256 -squares, multiplications by constants), the BCPZ quasilinear refresh gadget [10].
We first address implementation aspects of the two above masking schemes, then describes masking of AES and MiMC with these schemes and finally provide comparison of performances in terms of operation counts and randomness consumption.
Cautionary note: To ease the presentation, we consider that each gadget includes a refreshing of its output sharing, except the ISW multiplication gadget which achieves the SNI notion without further refreshing. In a masked computation, a sharing might be input of several gadgets which would be an issue with respect to region probing security (e.g. one could accumulate t probes on this sharing per gadget). We therefore impose that such a sharing is refreshed before each new usage.

Multiplication gadget on F p based on the NTT
It is well-known that polynomials can be multiplied in quasi-linear time in finite fields using the Number Theoretic Transform (NTT), a Fast Fourier Transform (FFT) which requires that the coefficient ring contain certain roots of unity. More precisely, it is possible to multiply two polynomials of degree ≤ N in a finite field F q in O(N log N ) arithmetic operations in F q if F q contains a primitive 2N -th root of unity (which occurs if and only if 2N divides q − 1). The number theoretic transform was introduced by Pollard [37] and we refer the reader to [43,Section 8.2] for an exhaustive description. As stated in [43,Theorem 8.18], if N is a power of S (N = 2 m ), the multiplication of two polynomials of degree < N in a finite field F q which contains a 2N -th root of unity requires 6N log(N ) + 6N additions in F q , 3N log(N ) + 4N − 2 multiplications by constants in F q , 2N (bilinear) multiplications in F q and 2N divisions by 2N in F q .
Our multiplication gadget (for an encoding order n) described in Subsection 5.1 over such a finite field has thus a total complexity of 8n log(n)+11n additions in F q , 5n log(n)+7n−2 multiplications by constants in F q and 2n (bilinear) multiplications in F q . It requires 2n log(n) + 2n random elements from F q .

Multiplication gadget on F 2 k based on the Gao-Mateer FFT
The classical NTT cannot be applied when the underlying field does not have the desired roots of unity. We describe the additive FFT algorithm proposed by Gao and Mateer in 2010 [27], which works over fields of characteristic two. The idea of this class of FFTs is to evaluate polynomials of degree m over a linear (additive) subspace of K[x] rather than a group and it comes in two flavors: generic algorithms for an arbitrary m, or specialized ones for m a power of two. The specialized algorithms are faster than the generic ones, but the condition on m heavily constraints their use. We will use it with a slight generalization of Cantor bases which we call self-folding bases, and which allow even more aggressive optimization than what is done in [16,15,21] and may be of independent interest.
Let F = F 2 k be a finite field of characteristic two.We consider the additive FFT of a polynomial f ∈ F[x], that is we evaluate f over the F 2 -linear span generated by m elements β 0 , . . . , β m ∈ F linearly independent over F 2 . This span contains N = 2 m elements, and is defined if and only if N ≤ 2 k , or equivalently m ≤ k.
Taylor Expansion. An important subroutine of the Gao-Mateer additive FFT is the Taylor expansion, a slight variant of the usual notion of Taylor series. It consists of writing any polynomial f ∈ F[x] of degree < N as follows: where each h i is a polynomial of F[x] of degree at most 1. Algorithm 6 (provided in Appendix D) is an algorithm presented in [27] for computing the Taylor expansion of a polynomial in the case where m is generic (and not a power of 2), in which it is shown to require 1 2 N log N − 1 2 N field additions and no multiplication.
Basis folding. We abusively call basis any subset B = {β 0 , . . . , β m−1 } ⊂ F m of m elements of F linearly independent over F 2 . 3 For a basis B of length m and an integer 0 ≤ i < 2 m which can be written as i = m−1 j=0 a j 2 j with a j ∈ {0, 1}, we will note: We will say that B[i] is the i-th element of B . An important subroutine in [27] consists of what we call folding a basis, a process we recall in Algorithm 7.
Additive FFT. Gao and Mateer [27] proposed an algorithm for computing the additive FFT of a polynomial f over the subspace B generated by a basis B. This algorithm is described in Algorithm 8 (Appendix D), which costs 2N log N − 2N + 1 field additions, as well as 1 4 N (log N ) 2 + 3 4 N log N − N 2 scalar field multiplications. No formal description of the inverse algorithm is given, but it is observed that an inverse additive FFT can be obtained by performing the inverse of each operation in the reverse order. Since the scalar field multiplications and their inversions can be precomputed, the cost of the inverse additive FFT is the same as for the forward algorithm.
Self-Folding Bases. One could assume that the choice of B is not too important, because the operations involving B can be precomputed anyway. However, we show in this subsection that carefully choosing B can go a long way in making the additive FFT faster, simpler to implement and less costly in memory.
In fact, in the case where m is a power of two, by taking β m−1 = 1 for their Cantor basis, Bernstein et al. [16] showed that one could saved up some of the multiplication operations (namely the computation that involved the inverse of β m−1 ) for the top-level recursion cases. We take this idea further and propose a specific kind of bases such that β m−1 = 1 at every layer of the recursion. In the following, those kind of bases are called self-folding bases. Proof. The first item is immediate from the definition. The second item is proven in the special case F = F and m = k in the appendix of [27], and the proof is constructive. From there, the general case is immediate to obtain by embedding the solutions for F in the larger field F and keeping only the m last elements of B.
From Proposition 1, we can see that B "folds onto itself": folding B into G, D yields subsets of B, and this self-folding property transfers to D.
The notion of self-folding basis is close to that of Cantor basis [17,27]; the only difference is that the elements are taken in reverse order, and that no condition is imposed on the basis elements belonging to a subfield F 2 m (such a subfield does not always exist, thus self-folding bases are slightly more general objects than Cantor bases). The most important difference, however, is how they are used. In [27,16,15], Cantor bases are used to speed up the specialized additive FFT algorithm of [27], which only works on a restricted set of parameters (namely, when m is a power of two). In comparison, our improvements apply to speed up the generic algorithm of [27], which works for any value of m.
Half-FFT. Our improved (half-)FFT works with a self-folding basis and its iterated foldings. It is described in Appendix D. Its main advantage is that the step 3 of Algorithm 8 becomes unnecessary; in total, this saves us N log N scalar multiplications. Moreover, it divides by two the number of precomputed tables. As another algorithmic optimization, we make full use of the fact that for polynomial multiplications, half the inputs of the FFT's call are zero coefficients which leads to speed up to the computations by a factor two compared to a regular additive FFT. These optimizations applies in a similar way to the inverse additive FFT (with the difference here is that we cannot exploit anymore the fact that half of the polynomial coefficients are zero).

Algorithm 3 Masked AES
A masked implementation of this process thus involves 18 linear gadgets as well as 15 refresh gadgets (for the variables used multiple times). The MaskedExp procedure is further depicted in Algorithm 4, which is based on the Rivain-Prouff scheme [41]. As explained above, a sharing in input of several gadgets is refreshed before each new usage.

Algorithm 4 MaskedExp
Require: n-sharing x of x ∈ F 256 Ensure: n-sharing of x 254 The gadget count of the masked AES is given in Table 4. According to Algorithm 4, the MaskedExp involves 4 multiplication gadgets, 3 linear gadgets and 4 refresh gadgets. According to the above description, we get a total of 72 linear gadgets and 60 refresh gadgets for the full MaskedMixColumns. One full round is composed of 16 calls to MaskedExp, one call to MaskedMixColumns, plus 32 linear gadgets (16 gadgets G ⊕ n and 16 gadgets G Aff n ). A full AES computation is composed of 9 full rounds, 1 partial round (without the MixColumns) plus one key addition (16 gadgets G ⊕ n ).

Masking of MiMC
Let x be some plaintext and k be some secret key, both belonging to some large field K.
The MiMC cipher is defined as: For our application, we consider the prime field variant of MiMC for which K can be chosen as any prime field F p such that gcd(3, p − 1) = 1 (so that x 3 is invertible on F p ). Since we wish to apply the GJR + scheme based on the NTT (as in the original GJR scheme), the chosen field must further satisfy p − 1 = α · (2n max ) for some odd integer α and some integer n max which is a power of two. In practice n max is the maximum masking order which can be achieved by the GJR + scheme. We hence choose a prime p = α · 2 +1 + 1 with gcd(α, 3) = 1 and = log 2 n max . Specifically, for a given target field size λ = log 2 p , we search for greatest integer and smallest integer α such that: (i) 3 α, (ii) log 2 α + + 1 < λ, and (iii) p = α · 2 +1 + 1 is prime. For our application, we thus instantiate MiMC with such 128-bit and 256-bit prime fields: • for λ = 128, we get p = 407 · 2 119 + 1, giving n max = 118, • for λ = 256, we get p = 467 · 2 247 + 1, giving n max = 246. Algorithm 5 gives a masked description of MiMC based on any standard circuit compiler. For the sake of clarity, we omit to apply the refresh gadget G R n to the output of an arithmetic gadget G ⊕ n or G ⊗ n and consider that the refresh is part of these gadget when necessary. For λ = 128 (resp. λ = 256), the number of rounds is r = 81 (resp. r = 162).

Algorithm 5 Masked MiMC
Require: n-sharing x of plaintext x ∈ F p , n-sharing k of secret key k ∈ F p Ensure: n-sharing of MiMC(x, k) 1: for i = 1, . . . , r do 2: 5: x ← G ⊗ n (x, y) 6: x ← G ⊕ n (x, k) 7: return x  Table 6 and Table 7 summarize the operation counts for full MiMC (with λ = 128) and full AES with the two masking schemes ISW + and with GJR + implemented over the same finite field. They show that our approach results in a 62% decrease in the randomness complexity and a 51% decrease of the number of multiplication for MiMC masked at order 128 and in a 46% (resp. 59%) decrease in the randomness complexity and a 20% (resp. 52%) decrease of the number of multiplication for AES masked at order 64 (resp. 128). For AES, the masking scheme ISW + can always be implemented over the F 256 finite field. However, to achieve provable region-probing security without relying on Hypothesis 1, Corollary 1 imposes that the masking scheme GJR + is implemented over a larger finite field such as F 2 128 (which has less efficient arithmetic). To compare the different complexities of the schemes GJR + and ISW + , we can implement F 2 128 as a degree 16 extension of F 2 8 = F 256 , so that: (1) an addition over F 2 128 takes 16 additions over F 256 , (2) a multiplication over F 2 128 takes 81 multiplications (and a large number of additions) over F 256 using Karatsuba's algorithm (since a multiplication of two polynomials of degree at most 16 = 2 4 over F 256 requires 81 = 3 4 multiplications, see [43, Section 8.1]) and (3) a random element of F 2 128 requires 16 random elements of F 256 . The computational efficiency of the masking scheme GJR + for AES compared to ISW + is then only better for masking order n ≥ 8192 and its randomness complexity is better for masking order n ≥ 2048.

Performances and Comparison
with probability at least 1 − ε over the random sampling of W. A circuit compiler (Compile, Encode, Decode) is (p, ε)-random probing secure if for every circuit C the compiled circuit C = Compile(C) is (p, ε)-random probing secure w.r.t. Encode (where p and ε might be a function of the encoding order and the circuit size).
Using the classical Chernoff's inequality, it is easy to prove that security in the region probing model implies security in the random probing model with appropriate parameters.
Proof. More generally, let δ ≥ 1 and let us suppose that p ≤ r/(1 + δ). Let W ⊆ | C| be a wire set where each wire from C belongs to W independently with probability p. Noisy Leakage Model. The noisy leakage model was first formalized by Prouff and Rivain [39]. In this model, the adversary may learn information about every single wire; however, instead of learning exactly the value x of a wire (as in the probing model), the adversary learns a randomized function f (x) of x. The generality of the noisy leakage model allows it to encompass several real-life instances of leakages, making it a very realistic leakage model. However, due to its somewhat analytical nature, security proofs are notoriously hard to do in it. Thankfully, is has been shown that the noisy leakage and random probing models are equivalent: one implication was proven [25] and improved in [38], the other one was proven in [38]. As a workaround to the complexity of the noisy leakage model, one can therefore establish security proofs in the random probing model and subsequently transfer them in the noisy leakage model using the equivalence between the two. Proposition 2 provides an additional tool to this proof strategy and, combined with results of [25,38], imply that a compiler secure in the region probing model is secure in the noisy leakage model. Security of the GJR + scheme. We have the following corollary of Theorem 3.

Corollary 2. If the FFT circuit is linear and under the t R
n -IOS property of the refresh gadget, the GJR + compiler is (p n , ε n )-random probing secure with and ε n = (2N ⊕ + 4N ⊗ ) · exp(−n · p n ) + 3N ⊗ · n |F| , (20) where N ⊕ (resp. N ⊗ ) is the number of addition gates (resp. multiplication gates) in the original circuit.
Proof. Let C be an arithmetic circuit composed of N ⊕ addition (or linear) gates and N ⊗ multiplication gates and let C be the corresponding compiled circuit output from the GJR + compiler. We consider a split of C into different regions following the region probing security reduction of the GJR + scheme. Specifically, each addition (or sharewise) gadget and each refresh gadget consists in a single region while each multiplication gadget gives rise to three different regions: the first block (i.e. the circuit considered in Lemma 1), the internal refresh, and the second block (i.e. the circuit considered in Lemma 2). We hence get a total of N reg = 2N ⊕ + 4N ⊗ regions in C (counting the refresh gadgets). In the random probing model, each wire leaks independently of the other wires with a given probability, denoted p n here. We show hereafter that with overwhelming probability over the distribution of the indexes of the leaking wires, the full random probing leakage can be perfectly simulated. More specifically, we exhibit two failure events F 1 and F 2 that may prevent such a perfect simulation. Whenever none of these failure events occur, a perfect simulation is achieved in the same way as in the region probing security reduction given above. The first failure event F 1 occurs when the number of leaking wires in at least one region R exceeds the threshold 2p n |R| where |R| denotes the number of wires in R. Applying the Chernoff bound, this occurs in a given region with probability lower than where the inequality holds since the minimal value of |R| is obtained for the addition gadget with |R| = 3n. We deduce that the first failure event occurs with probability Pr(F 1 ) ≤ N reg · exp(−n · p n ) .
Provided that the first failure event does not occur, the random probing simulator needs to simulate less than 2p n |R| wires per region that is r n |R| wires per region where r n = 2p n is as defined in Theorem 3 for t FFT n = (n − 1). This translates to simulating at most t FFT n = (n − 1) probes in each FFT circuit through the above security reduction. Our second failure event F 2 occurs whenever the n − 1 leaking wires within a FFT circuit cannot be perfectly simulated. Using Lemma 2 from [29] (see Appendix C) and thanks to the linearity of the FFT circuit, we have that for any choice of n − 1 leaking wires from the FFT circuit, the probability that the leaking wires cannot be simulated is lower than n/|F|. Besides the linearity of the FFT circuit, the only requirement for this upper bound to apply is that the choice of the leaking wires is made independently of ω, which occurs in the random probing model since the placement of the probes is randomly draw (with leakage probability p for each wire) independently of the random generation of ω. We deduce that the second failure event F 2 occurs with probability Whenever no failure event occur, the leaking wires within each FFT circuit can be perfectly simulated which -following the above reduction-implies that the overall leaking wires can be perfectly simulated. It results that the GJR + scheme is (p n , ε n )-random probing secure with ε n ≤ Pr(F 1 ∧ F 2 ) ≤ Pr(F 1 ) + Pr(F 2 ) , which together with Equation 22 and Equation 23 concludes the proof.
In the above random probing security proof, the tolerated number of probes in an FFT circuit is t FFT n = (n − 1) which implies γ ≈ 1 in Equation 11. On the other hand, our refresh gadget is such that β = 3. Assuming a FFT algorithm satisfying |FFT n | = α · n log n, we finally get For instance, the NTT algorithm used in the original GJR scheme satisfies α ≈ 6, which gives p n ≈ 1 60 log n .

C Lemmas from [29]
We recall hereafter the key lemmas from [29] for the security of the GJR scheme. Let FFT n be a linear FFT circuit on field K taking an ω-encoding as input. Every value v taken by a wire of FFT n can be expressed as where the α i 's are constant coefficients over K. The lemmas use the following notation [v] = (α 0 , α 1 , . . . , α n−1 ) T for the column vector of coefficients of such a wire value. Similarly, we shall denote [a] = (1, ω, ω 2 , . . . , ω n−1 ) T for an ω-encoding (a 1 , . . . , a n ) of a variable a since we have a = where span(·) refers to the linear span of the input matrix.
Lemma 4 (Lemma 2 of [29]). Let ω be a uniform random element in K * . And let v 1 , v 2 , . . . , v be a set of < n intermediate variables of FFT n on input an ω-encoding of a variable a. We have: where the above probability is taken over a uniform random choice of ω.
From these two lemmas, the values taken by any set of < n wires of FFT n can be perfectly simulated without knowledge of a. The simulation simply works by taking a random a, picking a random ω-encoding of a, and evaluating the wires v 1 , . . . , v accordingly leads to a perfect simulation. According to Lemma 2 of [29] such a simulation fails with probability lower than n/|K|.

D.1 Algorithms used in the Gao-Mateer additive FFT
This section presents the detailed algorithms used in the Gao-Mateer additive FFT.