Second-Order SCA Security with almost no Fresh Randomness

. Masking schemes are among the most popular countermeasures against Side-Channel Analysis (SCA) attacks. Realization of masked implementations on hardware faces several diﬃculties including dealing with glitches. Threshold Implementation (TI) is known as the ﬁrst strategy with provable security in presence of glitches. In addition to the desired security order d , TI deﬁnes the minimum number of shares to also depend on the algebraic degree of the target function. This may lead to unaﬀordable implementation costs for higher orders. For example, at least ﬁve shares are required to protect the smallest nonlinear function against second-order attacks. By cutting such a dependency, the successor schemes are able to achieve the same security level by just d + 1 shares, at the cost of high demand for fresh randomness, particularly at higher orders. In this work, we provide a methodology to realize the second-order glitch-extended probing-secure implementation of a group of quadratic functions with three shares and no fresh randomness. This allows us to construct second-order secure implementations of several cryptographic primitives with very limited number of fresh masks, including Keccak , SKINNY, Midori, PRESENT, and PRINCE.


Introduction
Physical attacks are a serious threat for many security-critical devices, where the attacker tries to gain sensitive information by monitoring the physical properties of the target device. Physical observations like power consumption or electromagnetic radiations can be exploited to recover sensitive data, if no proper countermeasure is employed. Hence, resistance against Side-Channel Analysis (SCA) attacks -as a sort of physical attack -is an important requirement for almost any deployed cryptographic device.
Kocher et al. [KJJ99] exploited the relation between the calculations performed on intermediate values and power consumption of the target device for the first time. Introduction of such a seminal work opened a new line of research into SCA attacks leading to more sophisticated attack strategies like Correlation Power Analysis (CPA) [BCO04], Mutual Information Analysis (MIA) [GBTP08], and Moments-Correlating DPA (MC-DPA) [MS16b]. Recently, several studies focus on applying Deep Learning (DL) to improve state-of-the-art SCA attacks [Tim19,RAD20]. This highlights the necessity of employing appropriate countermeasures to ensure physical security in applications, where the adversary has a chance to control the device. A wide range of strategies and theories have been proposed in the open literature to limit or eliminate the amount of SCA leakages, some focusing on software implementations, others on hardware platforms. Masking schemes, due to their sound theoretical basis and a good understanding of their requirements, are among the most common methods applied in practice and studied by the researchers. In masking schemes, sensitive intermediate values are randomized during the execution of the cipher breaking the relation between the processed secret-dependent data and physical properties of the underlying device.
Generally, in a masking scheme, every key-dependent variable is split into several shares (defining the order of sharing/masking), and all computations are performed on the masked data which can be seen as performing the computations on secret-shared data. Boolean masking is surely the most popular approach, although other masking schemes like Multiplicative masking [MRB18] or Inner Product masking [BFG + 17] can be beneficial depending on the application. It has been shown that masking increases the measurement complexity of an SCA attack exponentially in the number of shares provided that the leakage of each share is noisy enough and that each power sample depends on a bounded number of shares. This level of protection does not come for free; the area overhead and the latency of an implementation realizing a masking scheme grows approximately quadratically with respect to the number of shares [FGP + 18].
In the context of masking, the seminal contribution has been made by Trichina [Tri03] where a first-order secure AND gate was presented. As a follow-up work, Ishai et al. [ISW03] introduced a general methodology to mask a 2-input AND gate at the desired security order. However, the existing research exhibits leakage in their hardware implementations due to a well-known phenomenon in hardware platforms called glitches [MPO05,MME10]. Glitches are unwanted signal transitions at the output of a combinatorial circuit due to the unbalanced delay of its inputs. Hence, the result of the calculation also depends on the timing of the inputs and can potentially cause exploitable leakage in practice.
To assess the security of a given design some sort of abstraction should be made. The most convincing approach in this context is probing security model, firstly introduced in [ISW03]. In this method, the security order of the design is defined by the maximum number of probes that the attacker can place on intermediate signals of the circuit and observe their values simultaneously. It appears consistent with software implementations where the instructions are executed sequentially and each of them can be considered as an atomic gate. In contrast, this abstraction does not consider physical defaults, and hence the designs may exhibit leakage even though they are shown secure under the probing model. Over time, extensive research has been devoted to understand how to adjust this model covering hardware platforms. After introducing several models, the most convincing one seems to be robust probing model presented in [FGP + 18]. In order to take glitches into account, each probe in the glitch-extended probing model is extended to multiple probes implying an even stronger adversary. By probing a signal in a combinatorial circuit, the adversary gains information about all intermediate values and input signals involved in the calculation of the probed signal.
A critical question more than a decade ago was how to make a design secure considering physical defaults like glitches. In order to properly address this question, three comprehensible properties, associated with an implementation strategy called Threshold Implementation (TI), have been introduced in [NRR06]. This strategy is immune to glitches and guarantees the security of the design if all properties are fulfilled. In the underlying methodology, the number of shares is defined based on the desired security order and the algebraic degree of the function. It has been shown that the same level of security can also be achieved with the minimum possible shares, i.e., independent of the algebraic degree of the underlying function [RBN + 15, GMK16] even in the presence of glitches. In this technique, the masked realization is split into two parts, where registers should be placed in between to avoid the propagation of glitches, and fresh randomness should be used to avoid the leakages. In [SM20] a methodology is presented which avoids using fresh randomness in the first-order secure hardware implementations with two shares. The authors provided a first-order secure 2-input AND gate without fresh randomness for the first time and presented first-order secure implementation of a couple of ciphers with no fresh randomness under glitch-extended probing model. The situation is a bit different at higher orders. Classical TI forces to use at least five shares to protect the smallest non-linear function against second-order attacks [BGN + 14]. In contrast, when using minimum number of shares of three for such a security level, the use of a relatively high number of fresh masks is mandatory. For example, a second-order masked AES S-box requires 162 fresh mask bits [CRB + 16] (alternatively 84 bits [GMK17]).

Our Contributions
In this work, we pass a further step and introduce three-share hardware constructions which can provide second-order security without fresh randomness. Due to the complexity of the algorithms as well as the constructions, we limit ourselves to quadratic functions, i.e., with algebraic degree of two. In short, we present a group of quadratic functions whose three-share second-order secure hardware implementation can be realized without fresh randomness. The other not-supported functions need to be decomposed to such quadratic functions, which necessitates refreshing the intermediate shares to avoid multivariate leakages. As an outcome of our research, we provide three-share second-order glitch-extended probing-secure implementations of Keccak with no fresh randomness and the S-box of SKINNY, Midori, PRESENT, and PRINCE ciphers using only 8-bit fresh masks per clock cycle. We would like to highlight that we confirm the second-order security of our constructions by SILVER [KSM20] under glitch-extended probing model and by FPGA-based practical experiments. Our programs as well as the hardware implementations (HDL codes) are fully provided in the GitHub.

Background
In the following, after giving the used notations and basic definitions, we review the concept behind probing and glitch-extended probing security and restate the fundamental concepts of hardware masking, which are required to follow the rest of the paper.

Notations and Definitions
We denote binary variables ∈ F 2 with lower-case italic x and vectors ∈ F n>1 2 with uppercase italic X. We represent j-th element in a vector X with superscripts x j , i-th share of a variable with subscripts x i , coordinate functions with lower-case italic sans-serif f (.), vectorial Boolean functions by upper-case italic sans-serif F (.), and sets with calligraphic font F.
A Boolean function of n variables is a function of the form f : is the n-dimensional vector space over F 2 . We denote F : F n 2 → F m 2 to show a vector of Boolean functions.

Definition 2.
A Boolean function f : In this paper, we use Algebraic Normal Form (ANF), which is a representation of a Boolean function with a polynomial of n variables of X = x 1 , . . . , x n . In other words, every Boolean function f : F n 2 → F 2 can uniquely be expressed by an element in F 2 [x 1 , . . . , x n ], where F 2 [x 1 , . . . , x n ] is the ring of all polynomials with coefficients in F 2 .
Definition 3 (Algebraic degree of a Boolean function). The algebraic degree of a Boolean function f : F n x i , is defined as: where ∀V, α V ∈ F 2 and by v i we refer to the i-th element of V . In other words, the maximum number of variables that have to be multiplied determines the algebraic degree of a given Boolean function. Further, algebraic degree of a vectorial Boolean function is determined by the maximum algebraic degree of its coordinate functions.
In order to apply Boolean masking to provide s-th order security, we first split a variable x into at least s + 1 shares x i where i ∈ {0, 1, . . . , s} such that the sum of these shares is equal to the original value, i.e., x = + ∀i x i . As an initial sharing, we naturally can draw x 1 to x s from a uniform distribution at random and form the first share The application of Boolean masking on linear functions is straightforward, since the same function can be applied on each share independently. However, implementing the masked realization of a non-linear Boolean function is non-trivial and special care should be taken to avoid any leakage. This is actually the main difficulty and the core of major publications in the areas of Boolean masking.

Probing Security
Masking, as the most promising countermeasure against SCA attacks, has been widely applied in practice. Consequently, many different schemes have been proposed over the years, considering different applications, assumptions and security requirements [Tri03, ISW03, NRR06, RBN + 15, GMK16, NRS11, GM18,GIB18]. Some of the interesting questions in this context are how to evaluate the proposed masking schemes under different adversary models, and how to consider execution environments and physical defaults in the evaluation process.
One of the first attempts to address such questions was presented in [ISW03], where d-probing security model was proposed. In this model, the adversary is allowed to observe (probe) up to d intermediate values during the execution of the cipher. It has been repeatedly shown (e.g., in [MPO05,MME10]) that hardware implementations of such d-probing secure schemes fail to deliver security in practice. Namely, this model fits best into software platforms, where instructions can be viewed as atomic gates and there is no data-dependent activation timing. However, it is an inaccurate assumption for hardware platforms due to a common fact in CMOS technologies called glitches. In fact, the d-probing model does not cover specific physical defaults, such as glitches, couplings, or transitions [FGP + 18]. These undesired effects, which are inherent to the nature of physical implementations, may occur during the execution of the cipher on a device, hence violating the security assumptions leading to exploitable leakages.
To consider physical characteristics in the verification model, the relevant scientific communities have conducted extensive research on the development of formal models. After some trial and error, Faust et al. [FGP + 18] addressed the aforesaid questions by proposing an extension of the d-probing model called robust probing model, which can cover inherent physical properties of hardware platforms when evaluating the security of a design. Focusing on glitch-extended feature of such a comprehensive model, each probe, placed on a combinatorial circuit, propagates backward to the last synchronization point (registers). In other words, by placing a single probe, the adversary has information about all signals that may contribute to determine the value of the probed signal. This simple but effective abstraction significantly helped reducing the implementation cost of several schemes [BGR18,SM20]. Moreover, using such glitch-extended probing model, the authors of [MMSS19] demonstrated the insecurity of previous hardware-oriented masking schemes such as [RBN + 15, GMK17,GMK16]. This highlighted the importance and necessity of probing security proofs in masked implementation, leading to the development of formal verification tools to evaluate the given designs [BBC + 19, KSM20].

Masking with td + 1 Shares
Classical TI [NRR06] is the first implementation strategy that is immune to glitches in hardware implementations. It defines the minimum number of input shares as td + 1, where t and d stand for the algebraic degree of the function and the desired security order, respectively. Let us suppose that the number of input shares and output shares are the same and equal to s + 1. Hence, a masked realization of Y = F (X) receives input shares X 0 , . . . , X s and provides output shares Y 0 , . . . , Y s . Three essential properties were introduced in [NRS11] to guarantee first-order security. First of all, the sum of output shares should yield to the original output value for the sake of correctness, i.e., In order to fulfill the second property, the computation of each output share should be independent of at least one input share. To this end, the authors suggested to avoid giving X i as an input to the function which generates the output share Y i , so-called a component function. This property, called non-completeness, guarantees the glitch-resistance as the leakage of each component function is independent of X. Alternatively, by placing a glitch-extended probe on any component function, the adversary observes only a non-complete shared input, hence no information about X. The third and last property implies that for each value of X, all possible input shares X 0 , . . . , X s lead to a set of Y 0 , . . . , Y s which can be represented as F (X) = Y being shared by masks uniformly selected at random. Note that uniformity itself is neither a necessary nor a sufficient condition to achieve security of an implementation [MBR19]. It actually becomes important when secure gadgets are composed. In other words, we need to guarantee that the second gadget receives a uniformly-shared input. Otherwise, the essential underlying assumption of masking (secret sharing) is violated.
While non-completeness can be easily achieved by a methodology called direct sharing [NRS11], no systematic way is known to fulfill uniformity except remasking, i.e., refreshing the sharing of the output using fresh randomness. Daemen [Dae17] introduced a trick called changing of the guards to relax the need for fresh masks at every clock cycle. The underlying concept is based on re-using the unrelated parts of the cipher, e.g., shares of the neighboring S-box(es), as fresh masks. We should highlight that this technique can fulfill the uniformity of a correct and non-complete shared function, but cannot be beneficial if non-completeness is violated.
In short, constructing secure implementations becomes costly when the algebraic degree of the function increases or high-order security is desired. To cope with this issue, the target function is decomposed into smaller functions as their masked variants are easier to achieve. The secure constructions of the smaller functions are composed with register stages in between to avoid the propagation of glitches, or let say to avoid the propagation of glitchextended probes on all shares of a variable. TI also covers higher-order security [BGN + 14], where the main difference to the first order is the adjusted definition of non-completeness. Considering probing security, in a d-order secure implementation, every d probes placed on any part of the circuit should be independent of at least one share of every variable. As highlighted in [Rep15], the output sharing of a higher-order secure gadget should be refreshed before being fed into the next higher-order secure gadget.

Masking with d + 1 Shares
The high implementation cost of TI circuits has been reported using several case studies (e.g., [MPL + 11, CBR + 15]), particularly at higher orders due to using a high number of shares which naturally scales to a significant amount of fresh randomness when composing the functions. In two independent works [RBN + 15, GMK16], it has been tried to make the number of shares independent of the algebraic degree of the function, i.e., d + 1 shares for d-order security. These constructions can potentially lead to lower implementation costs in terms of area overhead and latency while maintaining the same level of security and glitch resistance that TI offers. In these techniques, the masked variant of the target function consists of two separate parts, which are divided by dedicated registers. More importantly, fresh masks should be used to ensure the security of the gadgets. Precisely speaking, in contrast to td + 1 where fresh masks might be need to fulfill the uniformity when the functions are composed, in d + 1 the fresh masks are essential to achieve non-completeness. In other words, an standalone td + 1 function can be secure without any fresh masks, but its d + 1 variant demands fresh masks for its security.
Following Domain Oriented Masking (DOM) [GMK16], which needs slightly-less fresh masks compared to [RBN + 15] in certain scenarios, a two-share masked variant of a 2-input AND gate x = f (a, b) can be realized as: where r is a single-bit fresh mask, a 0 , a 1 , b 0 , b 1 are input shares, and x 0 , x 1 are output shares. f l (.), 0 ≤ l ≤ 3 are known as component functions whose result should be stored in registers, identified by x 0 and x 1 . The part that XORs the registers' outputs to generate the output shares x 0 and x 1 is known as compression layer. It is shown in [RBN + 15] that the demand for fresh randomness can be more relaxed particularly in the first-order masked implementation of quadratic functions. A methodology has been later introduced in [SM20] which avoids using fresh randomness in the first-order d + 1 hardware implementations. We express the details of this scheme in the next section.

Technique
Below, we first shortly review the technique presented in [SM20] allowing to construct first-order secure implementations without any fresh randomness. Afterwards, we express our developments extending the underlying scheme to the second order.

First-Order d + 1 Masking with no Fresh Randomness
It is shown in [SM20] that two-share first-order representation of the 2-input AND x = f (a, b) = ab can be realized by four component functions f 0≤l≤3 (.), each of which receiving a combination of input shares as follow.
Its first-order security has been guaranteed through the following observations.
• Every component function receives only one share of each input, either a 0 or a 1 and either b 0 or b 1 , hence fulfilling non-completeness. Therefore, placing a probe on every gate of each component function does not leak any information about a or/and b.
• Following glitch-extended probing model, a probe placed on x 0 propagates to (x 0 , x 1 ). However, simulating (x 0 , x 1 ) for all possible sharings of (a 0 , a 1 , b 0 , b 1 ) shows a unique joint distribution for all values of (a, b). The same holds when a probe is placed on Further, it has been shown that (x 0 , x 1 ) is a uniform sharing of x = f (a, b) = ab if a and b are uniformly shared. Note that the above-given example is one of 16 solutions found for the 2-input AND in [SM20].
The same principle has been extended to cover up to 4-bit cubic functions allowing the authors to obtain the first-order secure implementations of coordinate functions of several 4-bit S-boxes. As the last step to construct the masked S-box, a combination of different solutions for each coordinate function should be found that fulfills the joint uniformity of the output sharing.

2-input AND
Moving toward second-order security, we start with the same simplest case, i.e., 2-input AND. Using three shares, a : (a 0 , a 1 , a 2 ) and b : (b 0 , b 1 , b 2 ), 9 component functions f 0≤l≤8 (.) are required to cover all 9 quadratic monomials ∀0 ≤ i, j ≤ 2, a i b j . Naturally, the result of each three component functions should be compressed (after being stored in dedicated registers) to form an output share, as exemplary shown below.
In addition to the corresponding quadratic monomial Hence, we should search for cases whose combination fulfills the requirements for second-order security. Compared to [SM20], we need to extend the checks and examine all possible two probes which can be placed on different parts of the implementation. To this end, we constructed a procedure shown in Algorithm 1 and Algorithm 2, explaining the entire process.
We start with constructing a set F 0,1,2 containing 3 component functions f 0 , f 1 , and f 2 that are jointly second-order secure and whose compression layer's output is balanced. As defined in Section 2.1, a Boolean function is balanced if its output yields as many zeros as ones over its input set. This is shown in lines 4 to 13 of Algorithm 1. Its second-order security is examined by three combinations of two probes: one probe at the output of the compression layer, which -under glitch-extended probing model -propagates backward to the output of registers storing the output of all three component functions, and the second probe on the input of one of such registers, which also propagates back to all inputs of the component function. For example, considering Equation (4), placing a probe on x 0 propagates to f 0 (a 0 , b 0 ), f 1 (a 0 , b 1 ), and f 2 (a 0 , b 2 ). If the second probe is placed on the first component function f 0 (a 0 , b 1 ), which propagates to a 0 and b 0 , the joint distribution of a 0 , b 0 , f 1 (a 0 , b 1 ), f 2 (a 0 , b 2 ) should be identical for all values of (a, b) over all possible sharings. Note that it is not required to consider f 0 (a 0 , b 0 ) in the joint distributions as all its inputs a 0 and b 0 are already covered. This check and other 2-probe combinations are shown in lines 6 to 8 of Algorithm 1. The final check is the balancedness of the compression layer's output, which is an essential condition for uniform sharing of the final construction [KSM20, § 4.6]. In the rest of Algorithm 1, this process is repeated to construct two other sets F 3,4,5 and F 6,7,8 for other tuples of component functions, identified by lines 14 to 23 and lines 24 to 33, respectively. The next step, shown in Algorithm 2, is to find an element in each aforementioned set, that jointly i) realize the sharing of f (a, b) = ab, ii) are second-order secure, and iii) form a uniform sharing for the output. In order to ease the first check, i.e., the correctness of sharing, we store the function made by each output of the compression layer. For example, in line 10 of Algorithm 1, in addition to component functions f 0 , f 1 , and f 2 , we store f 0,1,2 in set F 0,1,2 . For all elements of f 0,1,2 and all elements of f 3,4,5 , we calculate the XOR of the corresponding outputs of the compression layer, i.e., f 0,1,2 + f 3,4,5 . Since the desired output of the target function over input sharing, i.e., f * , can be easily achieved by replacing a by a 0 + a 1 + a 2 and b by b 0 + b 1 + b 2 in f (a, b) = ab, the expecting third output of the compression layer can be calculated as f * 6,7,8 = f * + f 0,1,2 + f 3,4,5 . If f * 6,7,8 exists in F 6,7,8 , we already found a solution that fulfills the correctness property. In order to accelerate this process, we enumerate the ANF of such functions and use sorted arrays (or sorted link lists) to rapidly find out whether the expecting function exists in a set.
After finding a correct solution, we need to examine its second-order security. Since through Algorithm 1, we included only those component functions in each set F 0,1,2 , F 3,4,5 , and F 6,7,8 , that are second-order secure, we need to just examine the cases where two probes are placed on functions of different output shares. For example, one probe on output share x 0 and another one on x 1 , which means examining the identical joint distribution of . Lines 6 to 8 of Algorithm 2 show this check and that of two other combinations where probes are placed on (x 0 , x 2 ) and (x 1 , x 2 ). In fact, many other probe combinations should be examined, where a probe is placed on an output share, e.g., x 0 , and the other one on a component function of a different output share, e.g., f 3 (a 1 , b 0 ) which propagates to (a 1 , b 0 ). This means examining the identical joint distribution of a 1 , b 0 , f 0 (a 0 , b 0 ), f 1 (a 0 , b 1 ), f 2 (a 0 , b 2 ) as shown in line 9 of Algorithm 2. There are 17 other such combinations that should be checked as given in lines 10 to 26. Note that we do not need to examine the combinations where both probes are placed on different parts of an output share, since they are already covered during the generation of the sets F 0,1,2 , F 3,4,5 , and F 6,7,8 in Algorithm 1. Further, it is not necessary to examine the cases, where probes are placed on different component functions, as only one share of each input variable is involved in every component function, i.e., second-order non-completeness. We also do not require to examine the first-order probing security, since having a second-order probing-secure design implies its first-order security as well. If the found correct and second-order probing-secure solution forms a uniform sharing, examined in line 27 of Algorithm 2, the found solution is a valid one.
Note that the above-given procedure is dedicated to the configuration shown in Equation (4). However, there is no must to place component functions f 0 , f 1 , and f 2 in the compression layer of the first output share x 0 . The component functions can be freely assigned to one of the output shares, but assigning more than three component functions to one output share would reduce the chance of having a valid solution since placing a probe on that output share would propagate to more than three component functions.  We wrote the programs in C++ to implement these algorithms and realized that there is no solution satisfying all the above-explained criteria. In fact, Algorithm 1 reports empty sets F 0,1,2 , F 3,4,5 , and F 6,7,8 for any of these 280 configurations. Getting back to DOM, three fresh mask bits are used to construct the second-order probing-secure 2-input AND with three shares. Hence, we adopted our algorithms to include fresh mask bits. Namely, we require to adjust line 2 in Algorithm 1 to include linear terms (as fresh masks) to each component function. Subsequently, to construct F 0,1,2 , the fresh masks should be considered in lines 6 to 10 when the second-order security of construction and the Algorithm 2 Search for 2nd-order 3-share rep. of 2-input quadratic function (part two)

Algorithm 1 Search for 2nd-order 3-share rep. of 2-input quadratic function (part one)
if (f 0,1,2 , f 3,4,5 , f 6,7,8 ) forms a uniform sharing then uniform sharing 28: end if 31: end for balancedness of the compression layer's output is being checked. In fact, when a probe is placed on a component function, all its inputs are probed including the fresh mask (if any). Since this process is repeated to generate F 3,4,5 and F 6,7,8 , lines 14 to 23 and lines 24 to 33 should also be adjusted accordingly.
With two fresh masks, our programs found 156 672 solutions only for the default configuration shown in Equation (4), each of which is second-order probing-secure with uniform output sharing. One of such solutions is shown below.

AND-XOR
As shown in [SM20], two-share first-order probing secure implementation of f (a, b, c) = ab+c can be easily achieved by replacing the fresh mask bit of the masked AND implementation by c 0 and c 1 . However, it is not trivially possible in the second order. Therefore, we adopted our algorithms to include one more shared input c 0≤i≤2 in each component function. Since c does not contribute to any quadratic monomials, every component function can have an additional linear monomial c 0 , c 1 , or c 2 , hence 16 cases for each component function.
Staying with the default configuration (Equation (4)), our programs found 73 728 solutions for f (a, b, c) = ab + c, without any fresh randomness 1 . One of the solutions is given below.

Quadratic Bijections
Our findings with respect to the possibility of realizing the second-order probing-secure and uniform sharing of the AND-XOR function, motivated us to examine the applicability of our algorithms on larger yet quadratic functions. Below, borrowed from [DC07], we restate the definition of Affine Equivalent, which is helpful to follow the rest of the paper.

Definition 4 (Affine Equivalent
with a, b, c, d the 4-bit input, x, y, z, t the 4-bit output, and a and x the least significant bits. The first coordinate function is the AND-XOR, which we have studied in Section 3.2.2. However, all solutions we found for AND-XOR do not necessarily make a jointly uniform 4-bit output sharing. We found out that 533 solutions of those reported in Section 3.2.2 fulfill the joint uniformity of the output sharing of Q 4 4 . The one given in Equation (6) is one of those 533 solutions. For the sake of completeness, full details of the shared Q 4 4 without fresh randomness is given in Appendix A.
Q 4 12 : 0123456789CDEFAB has a bit more complicated ANF: Compared to Q 4 4 , here we need to examine combinations, where two probes are placed on the circuit associated with different coordinate functions. We start with the third output bit z which is similar to the former cases. Since the second coordinate function (generating y) has cd in its terms, all possible quadratic monomials 0 ≤ i, j ≤ 2, c i d j appear in its component functions. Therefore, when we are searching for a solution for the third coordinate function f (b, c, d) = bd + c, we can already add more checks in Algorithm 1 when we construct the set of component functions F 0,1,2 , F 3,4,5 , and F 6,7,8 . More precisely, in addition to 3 checks in lines 6 to 8, 9 × 3 other conditions are added to reflect the cases where one probe is placed on a component function of y and one probe on an output of the compression layer of z, i.e., 0 ≤ ∀i, j ≤ 2, This way, we can strongly reduce the solutions for the third coordinate function. In fact, these extra checks result in finding solutions only for one configuration of component functions (see Equation (4)). No solution exists for the other 279 configurations. Our programs found 73 728 solutions for the third coordinate function.
By adopting the algorithms and the programs to the second coordinate function f (b, c, d) = bd + cd + b, we also found 3 072 solutions only for one configuration. As the last step, we need to search for a tuple of solutions for each coordinate function which i) jointly have uniform output sharing, and ii) are second-order probing secure. We have to examine all possible 2-probe combinations placed on different coordinate functions. In general, if we have n coordinate functions, we need to examine the cases where • both probes are placed on the output of compression layers, i.e., 3 × n 2 cases, and • one probe is placed on a component function and the other one on a compression layer's output, i.e., 9 × 3 × 2 × n 2 cases. As stated, due to the second-order non-completeness of component functions, we do not need to consider cases where both probes are placed on the component functions. In total, we need to examine 63 × n 2 probe combinations, in this case 63. Among the aforementioned solutions for the second and the third coordinate functions, we found several cases 2 which pass all 63 probe-combination checks and fulfill joint uniformity of the 4-bit output sharing. One of such solutions is given in Appendix B. x = bc + a, We realized that when we consider only three input variables in each component function (as each coordinate function is such), we cannot find any solution satisfying the second-order probing security when all quadratic monomials of three variables (here bc, bd, and cd) are involved. Therefore, we adjusted our programs to consider one more input variable thereby achieving second-order probing security. More precisely, considering the second coordinate function g(b, c, d) = bd + cd + b, we can use input shares a 0 , a 1 , and a 2 to fulfill the requirements for second-order probing security. However, it does mean that the found solution can easily make a joint uniform sharing with other coordinate function, particularly with the first one f (a, b, c) = bc + a, which depends on a as well. Since this leads to a huge number of solutions, we allowed a 0 and a 1 to be added to certain component functions to limit the valid solutions. By this, we found 14 592, 1 024 and 41 920 solutions for the first three coordinate functions, respectively. Our programs finally found several joint solutions for Q 4 293 satisfying all requirements, one of which is given in Appendix C. Q 4 294 : 0123456789BAEFDC has the following coordinate functions.
Since both first two coordinate functions are a form of AND-XOR, we easily achieved their corresponding solutions. The only condition, which should be additionally considered, is the existence of monomial bd (in the first coordinate function) when finding solutions for the second coordinate function cd + b. Similar to what explained for the third coordinate function of Q 4 12 , this extra condition forces the solutions to belong to only one configuration of component functions. As a result, we found 73 728 solutions for each coordinate function. Among them, we found thousands of joint solutions satisfying the second-order probing security (explained for Q 4 12 ) and joint uniformity of the output sharing. One of the solutions is given in Appendix D.
Q 4 299 : 012345678ACEB9FD with the following ANF has a coordinate function with four input variables.
After adjusting the programs to handle such cases as well, we found 3 072, 144 384, and 3 072 valid solutions for its coordinate functions. Note that the reason behind such a difference is that we considered only three corresponding input variables when looking for solutions for the first and third coordinate functions. If we include the missing input variable in the component functions as well, the number of found solutions would have significantly increased. Nevertheless, with the current solutions, we were able to identify one for each coordinate function which jointly fulfills all conditions. One of the found solutions is shown in Appendix E.
Q 4 300 : 0123458967CDEFAB has all three possible quadratic monomials between b, c, and d in its forth coordinate function, as given below.
This avoids our algorithms to find any solutions for its three-share second-order probing secure realization without fresh randomness.

Composition
As we have shown above, we are able to implement the identifier of all 4-bit quadratic bijective classes except Q 4 300 . However, it can be decomposed into two quadratic bijections from the other classes. More concretely, Q 4 300 can be written by a composition of two bijections belonging to Q 4 4 × Q 4 12 , Q 4 12 × Q 4 4 , Q 4 12 × Q 4 294 , and Q 4 294 × Q 4 12 . As an example, we can write Q 4 Hence, we are able to compose the descriptions given in Appendix A and Appendix B. However, we should emphasize that the composition of such designs does not necessarily lead to a second-order probing-secure implementation. As stated in [Rep15], when probes are placed of composed functions, there is no guarantee to maintain the higher-order security, while each function is individually higher-order secure. Hence, we have to refresh the signals traversing between the functions.
By integrating both A 1 and A 2 in Q 4 12 , and A 3 in Q 4 4 , we can write Q 4 300 = G • F with F :02468A13DF9BCE57 as and G:08192A3B4C5DE6F7 which is Q 4 4 with permuted outputs. When giving the shared output of F as the input to G, the following rules should be carefully followed.
• Every output share x 0 , x 1 , x 2 should be refreshed by two individual fresh mask bits r 0 and r 1 as x 0 + r 0 , x 1 + r 1 , x 2 + r 0 + r 1 . This also holds for those outputs which directly come from the input shares if it participates in a coordinate function. For example, the second output bit of F , i.e., y = a, does not need to be refreshed since a is not involved in any coordinate function of F . However, if Q 4 299 is composed, its forth output share t = d should also be refreshed since d is involved in its other coordinate functions.
• The refreshing should be performed together with the compression layer, i.e., where the output of the component functions (stored in register) are XORed to make the output shares.
• The output of the compression layer should be stored in a register before being given to the next function. Otherwise, a probe placed on the component function of the next function propagates backward to the compression layer and hence to the output of several component functions, which may make the implementation vulnerable even to first-order attacks. This has been discussed in detail in [SM20].
In short, we need 6 fresh mask bits and 3 register stages to realize a second-order probingsecure implementation of Q 4 300 . Full description of the component functions and how they are connected together is given in Appendix F.
We should mention that we also unsuccessfully tried to extend our algorithms to cover cubic monomials. Since the number of component functions increases from 3 to 9 for each output share, the search space explodes and the chance of finding a second-order probingsecure construction becomes low while each probe placed on a compression layer propagates to the output of 9 component function. Nevertheless, following the comprehensive study conducted in [BNN + 15], the 4-bit cubic S-box of all lightweight ciphers can be decomposed to quadratic bijections (in 2 or 3 stages), each of which is affine equivalent to one of the above-explained classes. Hence, we are able to construct their second-order probing-secure implementation, while fresh randomness is required only between their connection. We give more detail when dealing with some of such S-boxes in the next session.

Case Studies
In this section, we express some case studies to highlight the benefits and difficulties of the application of our technique on different symmetric cryptographic primitives.

Keccak
We first focus on Keccak [BDPA13], where a 5-bit S-box is used, called χ function. Each of its coordinate functions is a quadratic Boolean function with three input bits. More precisely, all of them are the AND-XOR that we have discussed in Section 3.2.2, where one of the AND operands is complemented. The ANF of the coordinate functions is given below.
where a, b, c, d, e and x, y, z, t, w are the 5-bit input and output, respectively.
Looking at the state of the art, a first-order td + 1 masked implementation of Keccak with three shares is first given in [BDPA10], whose output sharing is not uniform; one approach to achieve uniformity is to use fresh randomness. A uniform first-order td + 1 solution with four shares was then introduced in [BDN + 14], which fulfills all requirements without fresh randomness. The trick known as "changing of the guards" was afterwards introduced in [Dae17], that overcomes the non-uniformity of the design in [BDPA10] by re-using the shares of the i-th S-box instance as the fresh mask for the i + 1-th S-box, hence not requiring fresh randomness in each clock cycle.
The first d + 1 masked Keccak with two shares is given in [GSM17a] with DOM as the underlying technique. Although each DOM AND operation needs a fresh mask bit (see Equation (2)), due to the AND-XOR nature of χ's coordinate functions, the additive variable is used to blind the AND, and finally a two-share Keccak without fresh randomness is presented in [GSM17a]. A security flaw in its implementation (with respect to the location of registers), and two-round first-order secure implementations with five (and six) shares are reported in [ABP + 18], which do not make use of any fresh randomness. We also have observed that the implementation given in [GSM17a] does not maintain the uniformity of the χ's output sharing. Although the authors claim that the security loss is negligible [Dae16], for the sake of completeness, we give a solution with uniform output sharing in Appendix G by applying the technique presented in [SM20]. We indeed found 274 924 such solutions.
To the best of our knowledge, the only second-order secure Keccak is given in [GSM17a], where each AND-XOR operation is masked following the second-order DOM multiplier, i.e., 3 fresh mask bits per coordinate function, hence 15-bit fresh randomness per 5-bit S-box of the χ function. As stated in Section 3.2.2, for an AND-XOR, we found 73 728 solutions without fresh randomness. However, considering all 5 coordinate functions, they are not necessarily jointly uniform or second-order probing secure. To reduce the search space, we employed the same technique explained in Section 3.2.3. For instance, the first coordinate function (generating x) receives a, d, e as the input, where de is the only quadratic monomial. The term ae exists in the ANF of the second coordinate function (generating y), and the monomial ad does not show up in the ANF of the other coordinate functions. Hence, when we search for a solution for the first coordinate function, we added extra checks to consider a probe on each possible quadratic monomial 0 ≤ ∀i, j ≤ 2, a i e j . Due to χ's rhythmic ANF pattern, i.e., XORing each bit with an AND result of two other adjacent bits in its row, the same technique can be applied to the other coordinate functions. Namely, we add extra conditions to put a probe on each component function of the coordinate function i + 1 mod 5 when searching for a solution for the i-th coordinate function. In this way, we reduced the number of solutions to 24 for each coordinate function. Note that by such extra conditions, the solutions can be found for only one configuration of the component functions (see Section 3.2.1). Finally, by searching through the found solutions, we identified 659 cases which jointly fulfill all requirements for second-order probing security and uniform output sharing. One of such cases is given in Appendix H. Based on our S-box constructions, we have designed a two-share and a three-share round-based implementation of Keccak [1088,512] permutation without any fresh masks. It is one of the SHA3 standards and allows to provide a fair comparison to the state of the art. The underlying design architecture is shown in Figure 1. As shown, we placed a register at the input of the θ transformation. Alternatively, it can be placed at the output of the compression layer. This is essential since θ combines (XOR) every two neighboring output bits of each 5-bit S-box of the χ function. Without such a register, a probe placed on the XORs of θ propagates backwards to two outputs of the compression layer. Then, a second probe can be easily found to show a second-order leakage. Further, a register at the input of the χ function is essential to not violate the requirements for the security under the glitch-extended probing model [ABP + 18]. As a comparison to the state of the art, we refer to Table 1 where our first-order secure design is the only one which i) uses two shares, ii) does not require any fresh masks, iii) and has uniform output sharing. Note that the designs presented in [GSM17a] suffer from non-completeness issue as addressed in [ABP + 18]. Afterwards, the implementations were modified and the results were updated in [GSM17b]. Hence, we compare only with the corrected implementations. Regarding the second order, our construction outperforms the only-previously-published one in terms of randomness complexity and area overhead, as shown in Table 1. Note that the given performance results are excluding PRNGs required to generate the fresh masks, but still our design, which does not use any fresh masks, needs less area footprint. In order to be compatible with the state of the art, we synthesized our designs using UMC 130 standard cell library.
One more important fact to discuss is multivariate leakages between two consecutive rounds. We are not refreshing the χ output, which goes through the diffusion layer (θ, ρ, and π) and is given to the next χ function. A question is whether anything can be gained by placing two probes on two χ operations in consecutive rounds. As a general rule, when two second-order secure functions are composed, fresh masks are required at their conjunction, as illustrated in Section 3.2.4. Our observation is that if there is  a strong diffusion layer between such compositions, no mask refreshing is required. In case of Keccak, independent of the bit permutations ρ and π, each output bit of θ is the XOR result of 11 bits; 9 of them are taken from the output of 9 different 5-bit S-box instances. We expect that every probe placed on the second χ function observes a distribution independent of any other probe placed on the first χ function. Note that it is just our observation confirmed by practical experiments expressed in Section 5. We further should highlight that no verification tool is yet able to evaluate full cipher implementations; hence we cannot provide any proof for this observation.

SKINNY
The 4-bit S-box of SKINNY [BJK + 16]: C6901A2B385D4E7F belongs to the cubic class C 4 223 which can be decomposed as A 3 • Q 4 294 • A 2 • Q 4 294 • A 1 . Among 262 144 ways for the decomposition, we identified a case with the simplest affine functions as A 1 :FEBA7632DC985410: x = a + 1, y = d + 1, z = b + 1, t = c + 1 , A 2 :084C2A6E195D3B7F: x = d, y = c, z = b, t = a , and A 3 :FDECB9A875643120: x = b + 1, y = a + 1, z = c + 1, t = d + 1 , which are just bit permutations and negation of input/output variables. Therefore, the solution given for Q 4 294 in Appendix D can be directly used here. Note that two output bits of Q 4 294 are directly connected to its inputs (see its ANF in page 720), but since they both are involved in the coordinate function of the other output bits, all outputs of the first masked Q 4 294 should be refreshed and stored in registers before being given to the second masked Q 4 294 . Therefore, we made the second-order probing-secure and uniform sharing of the SKINNY 4-bit S-box in 3 register stages using 8 fresh mask bits.
In order to construct a round-based secure implementation of the SKINNY-64, placing a state register at the output of the S-box is not necessary (see Figure 2). That is because the diffusion layer of SKINNY, including AddRoundTweakey (ART), ShiftRows (SR), and MixColumns (MC), does not mix different output bits of any S-box. More precisely, each output bit of MC is the XOR of at most three bits that belong to three different S-boxes. However, as explained in the case of Keccak, a register stage at the input of the S-box is essential, which is used in our design as the state register. Note that here we do not need to place any register because of the affine functions A 1 , A 2 , and A 3 , since -as stated -they are just bit permutation and negation. The diffusion layer of the SKINNY round function is not as strong as that of the Keccak. For example, one row of the cipher state passes through MC unchanged. Therefore, our argument with respect to multivariate leakages across consecutive cipher rounds, given for Keccak, is not valid here. However, ART can be beneficial here. Let us denote the output of SubCells (SC) by (A, B, C, D), where each element corresponds to a row, here 4 nibbles. Ignoring SR, which permutes the nibbles of each row, the output of MC can be written as where (K, K ) denote the 32-bit round tweakey represented in two rows. It can be seen that -except the second row -each input bit of any S-box in the next round is the XOR result of at least two output bits belonging to different S-boxes. If the round tweakkey is presented in a second-order masked form, i.e., with three shares (as in our design in Figure 2, it plays the role of the fresh mask and blinds the second row A + K. Therefore, it is essential to apply key masking, i.e., the key schedule should also be masked with three shares.
In summary, our fully-pipeline round-based implementation has four register stages and requires 8 × 16 fresh mask bits per clock cycle. Table 2 shows the corresponding performance figures. We constructed SKINNY-64-64 encryption function, i.e., with a 64-bit key; the other variants with larger keys can be easily constructed since the SKINNY key schedule is a linear function. Further, due to the lack of higher-order implementation of SKINNY in the open literature, we could not find any other design for comparison.

Midori
Midori's 4-bit S-box S:CAD3EBF789150246 [BBI + 15] is affine equivalent to the identifier of the class C 4 266 . Among several ways to decompose it to quadratic bijections, we selected the case as , and A 3 :FD75A820EC64B931: x = c + d + 1, y = a + 1, z = c + 1, t = b + 1 . Integrating A 1 into the first Q 4 12 would lead to having all quadratic monomials of three input variables in a coordinate function, and we would face similar difficulty observed for Q 4 300 . In general, we prefer the decompositions with a simple input affine A 1 , and would  integrate the middle and output affine functions A 2 and A 3 at the output of the quadratic functions. More precisely, we write S = G • F • A 1 with F = A 2 • Q 4 12 :08C43BF7192AE6D5 as and G = A 3 • Q 4 12 :FD75A820ECB93164 as x = bd + c + d + 1, y = a + 1, z = bd + c + 1, t = bd + cd + b + 1 .
An important point to mention is that the output of the A 1 should be stored in a register before it is connected to the input of the shared F . Otherwise, the non-completeness would be violated. Further, due to the existence of the input affine, we have to consider more checks when we search for solutions for each coordinate function of F . More precisely, a probe can be placed on the XOR of the input affine, and another probe on component functions or compression layer of F . In order to cover any affine function placed at the input of F , we consider three cases, where the probe is propagated to all input variables of the same share index, i.e., In other words, similar to what explained as extra conditions for Q 4 12 in page 719, the combination of each of these propagated probes and a probe placed on every output share should be considered in lines 6 to 8 of Algorithm 1, i.e., 3 × 3 extra conditions for each coordinate function. This way, we make sure that any affine function, placed at the input of the target function, would not violate the second-order probing-security of the implementation. By adjusting our programs and considering these extra checks for F , we found several solutions for each F and G, while one of them is given in Appendix I.
The design architecture of our fully-pipeline round-based second-order Midori-64 supporting both encryption and decryption is depicted in Figure 3, which is similar to the one presented in [MS16a]. As stated, 8-bit fresh masks should be used for the composition of F and G, which is integrated into the compression layer of F whose results should be stored in registers. As a result, the second-order secure Midori's S-box needs 4 register layers and 8-bit fresh masks. Note that we do not need any further registers to implement the cipher as one of the register stages can be seen as the state register, and Midori's MixColumns (MC) only mixes the output bits of different S-boxes, similar to Skinny. Namely, when a probe is placed on MCs' output bit, it propagates to output bits of different S-boxes that are statistically independent. Regarding multivariate leakages between two consecutive rounds, the argument is similar to the one given for Keccak and SKINNY. Each MC's output bit is the XOR of three different bits of different S-boxes, and together with key masking, this blinds the S-box outputs that are given to the next round function avoiding second-order leakages. Note that, these arguments are valid since each S-box itself (including several stages) is second-order probing secure. The synthesis result of our design is also involved in Table 2. Notably, our design seems to be the only second-order implementation of Midori-64 in the literature.

PRESENT
Similar to Midori, the PRESENT S-box [BKL + 07] S:C56B90AD3EF84712 belongs to the cubic class C 4 266 . Therefore, we follow the same principle and express it as and G:9C3672D805EB41AF as where both F and G are affine equivalent to the quadratic class Q 4 12 . As a matter of chance, here F is the same as that of Midori. Therefore, we only give the sharing of G in Appendix J. Note that all given statements with respect to the input affine A 1 and fresh randomness given for the S-box of Midori hold valid here as well. In summary, we provide a three-share second-order probing-secure realization of the PRESENT S-box with uniform output sharing in 4 register stages making use of 8-bit fresh randomness.
Most of the implementations of PRESENT reported in the literature follow a serialized architecture, where a single S-box is instantiated and shared with the key schedule as well. Staying with the same fashion, we took the design of PRESENT-80 presented in [PMK + 11] and plugged our S-box as shown in Figure 4. At each clock cycle, the state-and the key-registers are shifted nibble-wise and feed the S-box (16 clock cycles by the state and one clock cycle by the key schedule). After 20 clock cycles, when sBoxLayer is accomplished, in one clock cycle the permutation pLayer performed and the key schedule is finalized.
Regarding multivariate leakages across two consecutive rounds, we should have a closer look at AddRoundKey and the pLayer. Since pLayer is just a bit permutation, the 4-bit input of any S-box of the next round in a concatenation of 4 output bits of 4 different S-boxes in the previous round. Supposing that each S-box is individually shared, the 4 input bits of any S-box in the next round are independently shared. Moreover, by applying key masking the shares of the input of every S-box is again refreshed.
As a comparison to the state of the art, a second-order secure PRESENT S-box is presented in [CBRN14] in which polynomial masking is employed as the underlying masking scheme. Looking at the performance results in Table 2, the randomness complexity and latency of their masked S-box are extremely higher than our design.

PRINCE
Both S-box S:BF32AC916780E5D4 and its inverse S −1 :B732FD89A6405EC1 are used in encryption as well as decryption of PRINCE [BCG + 12]. Based on the study published in [MS16a], the S-box belongs to the cubic class C 4 223 which cannot be decomposed to two quadratic bijections of those classes that we cover [BNN + 15]. Following the decomposition given in [MS16a] for the S-box inverse, we write and H:21748BDE65039AFC as Using our programs adjusted to these coordinate functions, we found several solutions for each F , G, and H satisfying all requirements to be second-order probing-secure with uniform output sharing. One of such solutions is given in detail in Appendix K. Note that the S-box and its inverse are affine equivalent as S = A • S −1 • A with A:B8A93021EDFC6574: x = a + b + d + 1, y = a + 1 z = d, t = c + 1 . As stated in Section 4.3, we considered those extra checks with respect to the input affine when constructing the component functions. Therefore, placing A at the start of S −1 would not violate its second-order security. Since we split the S-box inverse into three quadratic parts, and we should refresh when the functions are composed, our construction is in 6 register stages and needs 16 fresh mask bits for each S-box/inverse calculation. It is important to recall that if a pipeline design is made, the 8-bit fresh mask required for G • F and those required for H • G can be connected to the same source, i.e., 8-bit fresh randomness per clock cycle. We give more details about the security of this optimization in Appendix L.
In order to construct a secure implementation of the cipher, an extra register layer should be placed at the output of the S-box inverse due to the affine function A at the end of S −1 . Figure 5 depicts the design architecture of our fully-pipeline round-based second-order PRINCE supporting both encryption and decryption. Similar to Midori's MC, each bit of the output of the M -layer is the XOR result of three output bits of different S-boxes with independent sharing. combined with key masking, the same arguments, given with respect to avoiding multivariate leakages across two consecutive rounds, hold here as well. In summary, our fully-pipeline design has 7 register layers per cipher round and needs 8 × 16 fresh mask bits per clock cycle.
We are aware of one work dealing with second-order masked hardware implementation of PRINCE presented in [BKN19]. The authors decomposed the S-box inverse into quadratic bijections, where all of them belong to Q 294 class. They provided 5-share and 3-share second-order masked implementation of Q 294 using 10 and 18 bits fresh masks (per clock cycle) and made a loop over it with different affine functions to realize either the S-box or its inverse. As one can see in page 725, our construction has less randomness complexity Figure 5: Design architecture of our round-based second-order PRINCE encryption/decryption function. but higher area overhead due to its fully-pipeline architecture leading to higher throughput. The authors also presented two second-order secure designs without S-box decomposition. Looking at Table 2, with even lower throughput, the randomness complexity and area overhead of their designs are way higher than ours. Note that in order to be more fair in the shown comparison, we synthesized our designs by UCM 90 standard cell library. The performance figures in [BKN19] are based on TSMC 90 which is out of our access.

Analysis
As the first analysis step, we employed SILVER [KSM20] to examine our S-box constructions under glitch-extended probing model, dedicated to masked hardware designs. SILVER is a formal verification tool, developed to check the design based on the proofs to avoid writing the proofs for every design. It receives the gate-level netlist of a hardware design and reports the result of evaluations based on the security notions defined in different articles like [MBR19]. Since SILVER does not simplify anything, its analysis results are reliable (without false positive or false negative). To this end, we synthesized the HDL code of our S-box designs (also given in the GitHub) and supplied SILVER with the resulting netlist. For our entire designs, SILVER reported robust-probing security up to second order as well as the uniformity of output sharing. Since no verification tool (including SILVER) is yet able to analyze full cipher implementations, similar to the state of the art, we additionally conducted FPGA-based experimental analyses, as given in detail as follows.

Setup
We implemented our designs expressed in Section 4 on the target Spartan-6 FPGA of the SAKURA-G board [SAK]. We collected power consumption traces by monitoring the voltage drop over a 1 Ω resistor placed in the Vdd path of the target FPGA using a digital oscilloscope at sampling rate of 500 MS/s. During the measurements, the target FPGA were supplied by a stable clock source at the frequency of 6 MHz. The target FPGA receives masked input (plaintext) and issues output (ciphertext) also in the same sharing form. The fresh mask bits (if needed) are generated on the fly inside the target FPGA by means of 31-bit LFSRs optimized for Xilinx FPGAs [DMW18]. For each mask bit, we instantiated one LFSR seeded at random right after the power-up of the FPGA.
For each design, we collected 100 million traces following the strategy explained in [GJJR11] to conduct reliable fixed-versus-random t-test. We further followed the techniques presented in [SM15] to efficiently perform t-tests at higher orders.

Results
We start with our Keccak design, given in Section 4.1. Due to the large size of the Keccak instance benchmarked in Table 1, which hardly fits into our target FPGA, similar to [ABP + 18] we practically evaluated a smaller variant, i.e., Keccak-f [200]. Note that χ function, whose protection is the difference between the state of the art, is the same in all Keccak variants. It instantiates the 5-bit S-box a different number of times. Figure 6 shows a sample power trace and the result of up to third-order univariate t-tests indicating no detected leakage up to second order. Since we also aim at evaluating our design with respect to multivariate leakages (i.e., the combined sample points are taken from different clock cycles), we performed bivariate t-test as formulated in [SM15]. Here, each power trace contains 5 000 sample points translating to 5 000 × (5 000 + 1)/2 = 12 502 500 individual t-tests, which take around 30 days to accomplish using all cores of a 24-CPU machine running (at most) at 2.93 GHz. Due to the same difficulty, such analyses are usually done on a small part of the traces downsampled, e.g., by covering only one S-box calculation and dividing the sampling rate (e.g., by 4 as done in [CRB + 16]). Power consumption traces are inherently low-pass filtered by the Printed Circuit Board (PCB), shunt resistor, the chip package, and the measurement equipments [MOP07] Hence, several sample points in each clock cycle of power traces (close to the power peak, i.e., clock edge) contain the same information about the leakage at that clock cycle (see [MM13] for relevant information). Therefore, considering the power peak at each clock cycle should be adequate for such a bivariate analysis. Therefore, instead of decreasing the sampling frequency (which should be synchronous with the device clock at very low sampling rates [OC15]), we extracted one sample (power peak) per clock cycle for the bivariate analyses, but covered the entire clock cycles involved in the power traces. The results shown at the left side of Figure 6(e) are inline with our expectations, i.e., no detected bivariate second-order leakage. However, in order to verify our bivariate setup, we also implemented a first-order version of the same Keccak variant. To this end, we took the two-share description of the χ function given in Appendix G, and performed the same bivariate analysis. The corresponding results depicted at the right side of Figure 6(e) confirm the correctness and ability of our setup to detect such bivariate leakages.
We conducted exactly the same analyses on all our other designs, all of which actually lead to the more of less similar results (given in Appendix M), and with the same conclusion, i.e., no first-and second-order univariate and bivariate leakage detected. We should highlight that except in the case of our PRESENT implementation, which follows a serialized architecture, in all our measurements we cover the entire calculation of the algorithm (can also be recognized from the shown sample power traces). For the PRESENT design, we cover only the first half of the encryption, which is already 300 clock cycles.

Discussions and Conclusions
In this work, we have introduced a methodology to achieve three-share second-order secure implementation of a group of quadratic functions without any fresh randomness. Naturally, by composing such designs we can realize larger constructions. However, refreshing the sharing of the interconnections is inevitable for higher-order security. We showed that having a quadratic round function with a strong diffusion layer, e.g., in Keccak, allows us to realize the second-order secure implementation of the cryptographic primitive without any fresh randomness. Although it is not the case for even lightweight ciphers with a 4-bit S-box, their second-order secure implementations require fresh masks only for composition, i.e., 8 bits per S-box and per clock cycle. To the best of our knowledge, our constructions outperform state-of-the-art implementations with respect to area, throughput, and demand for fresh randomness. More importantly, evaluations based on SILVER [KSM20] confirm  the second-order security of our designs under glitch-extended probing model. Naturally, the most interesting and useful case would be the application of our technique on the AES S-box. It is for sure among our future works, but it seems challenging as the entire operations of the S-box should be represented by quadratic functions. This would lead to a high number of compositions and consequently a high number of registers as well as fresh masks. Therefore, it needs intensive research to cope with such difficulties to outperform the state of the art.
Reduction of the number of required fresh masks per cipher round (due to the composition) is of interesting topics as well. We should refer to the relevant study [BDZ20], which addresses some interesting facts with respect to reusing the masks to refresh the shares between consecutive cipher rounds in higher-order td + 1 masked designs. Further, it has been stated in [BKN19], that when each S-box requires n-bit fresh masks, having 16 S-boxes in a second-order implementation of PRINCE, 4n fresh mask bits are adequate to apply instead of 16n bits, i.e., reusing the fresh masks. We should refer to "changing of the guards" [Dae17] which also tries to reuse the shares of irrelevant cipher states as fresh masks in first-order td + 1 masked implementations. Clearly, this topic needs more research and investigations particularly for higher orders.
In all these research activities, the goal is to avoid or reduce the required fresh masks. As given in the performance figures (Table 1 and Table 2), similar to the state of the art, we exclude the PRNGs necessary to generate the fresh masks. A fundamental question, which is not yet answered and needs proper attention, is how expensive it is to generate a certain number of fresh masks per clock cycle. As the cost function, area, energy, power, and latency are certainly the possible choices. F (a, b, c, d) : 0123456789CDEFAB c 1 , d 1 , a 0 ) = b 0 d 1 + c 1 d 1 + a 0 + c 1 → y 3 g 4 (b 1 , c 2 , d 1 , a 1 ) = b 1 d 1 + c 2 d 1 + a 1 + b 1 → y 4 y 3 + y 4 + y 5 = y 1 g 5 (b 2 , c 0 , d 1 ) = b 2 d 1 + c 0 d 1 + b 2 + c 0 → y 5 g 6 (b 0 , c 2 , d 2 ) = b 0 d 2 + c 2 d 2 → y 6 g 7 (b 1 , c 1 , d 2 , a 1 ) = b 1 d 2 + c 1 d 2 + a 1 → y 7 y 6 + y 7 + y 8 = y 2 g 8 (b 2 , c 0 , d 2 ) = b 2 d 2 + c 0 d 2 + b 2 + c 0 → y 8  h(a, b, c, d)

G 2-share Masked χ function without Fresh Randomness K 3-share Masked PRINCE S-box Inverse with 16-bit Fresh Masks L Fresh Mask-reuse in PRINCE S-box
As explained in Section 4.5, the PRINCE S-box (resp. its inverse) needs to be decomposed to three quadratic bijections as H • G • F • A 1 . Therefore, we need to use 8-bit fresh mask r 1 when we compose G with F • A 1 and another 8-bit fresh mask r 2 when composing with H. However, since r 1 and r 2 are required in different clock cycles (indeed with 2 clock cycles distance) if a fully-pipeline design is made, we can provide r 1 and r 2 using the same source of randomness which is updated at every clock cycle, i.e., 8-bit fresh randomness per clock cycle. In order to provide evidence for the security of such an optimization, we constructed a test circuit as shown in Figure 7, which emulates the pipeline architecture. More precisely, two S-boxes are performed with 2 clock cycles distance. Hence, 8-bit r 2 used for the second composition of the first S-box is re-used by the first stage of the second S-box. After synthesizing the circuit, which receives 24 fresh mask bits and an 8-bit input and provides an 8-bit output (both shared with three shares), we gave the corresponding netlist to SILVER [KSM20], which confirmed its second-order security under glitch-extended probing model and uniformity of its output sharing. Note that this optimization is possible since each function F , G and H is individually second-order glitch-extended probing secure with uniform output sharing, and the fresh masks are only used to avoid multivariate leakages with respect to probes places on different functions.