Re-Consolidating First-Order Masking Schemes Nullifying Fresh Randomness

. Application of masking, known as the most robust and reliable countermeasure to side-channel analysis attacks, on various cryptographic algorithms has dedicated a lion’s share of research to itself. The diﬃculty originates from the fact that the overhead of application of such an algorithmic-level countermeasure might not be aﬀordable. This includes the area-and latency overheads and the amount of fresh randomness required to fulﬁll the resulting design’s security properties. There are already techniques applicable in hardware platforms that consider glitches into account. Among them, classical threshold implementations force the designers to use at least three shares in the underlying masking. The other schemes, which can deal with two shares, often necessitates the use of fresh randomness. Here, in this work, we present a technique allowing us to use two shares to realize the ﬁrst-order glitch-extended probing secure masked realization of several functions, including the S-box of Midori, PRESENT, PRINCE, and AES ciphers without any fresh randomness.


Introduction
The rapid deployment of Internet of Things (IoT) necessitates physical security in addition to analytical security of the underlying cryptographic primitives. This is due to the fact that in the IoT scenarios the device is in hand and control of the legitimate users who can play the role of an adversary. Among physical attacks, Side-Channel Analysis (SCA) attacks [KJJ99, QS01] are considered as the most threatening attack vector, as often the device cannot detect if its physical characteristics are being measured, e.g., its power consumption. After the introduction of such attacks in the open literature, the relevant scientific communities have dedicated a considerable body of research on understanding its foundations and the development of defeating mechanisms. Due to their sound theoretical basis, masking countermeasures have absorbed the attention of the researchers at most. Based on secret-sharing schemes, the key-dependent intermediate values of the cipher are randomized by applying a masking countermeasure, usually done at the algorithmic level. In the most common scheme, Boolean masking [GP99], sensitive variables are split into several shares, whose addition (binary XOR) results in the same original (unshared) value.
While it became almost known how to correctly apply masking schemes on software implementations, their application on hardware designs was still an ambiguous process. It has been repeatedly shown that thought-secure masked hardware implementations [Tri03, OMPR05] exhibit exploitable leakages [MPO05,MME10]. This shortcoming is due to not having atomic gates in hardware, leading to a phenomenon called glitches. Finally, this issue has been theoretically addressed in [NRR06], where an implementation strategy is introduced, called Threshold Implementation (TI), later extended to higher orders in [BGN + 14a].
Security of masking schemes is commonly evaluated by the probing model [ISW03], where the order of an attack is reflected by the number of probes simultaneously placed on a device observing its intermediate signals. Although this model captures software implementations' leakage, where operations have a sequential nature, the adversary gains more information by probing a single signal in a hardware circuit due to the glitches. Therefore, the model is extended to cover glitches (called glitch-extended probing model) [FGP + 18]. In this model, a probe placed at the output of a gate is propagated backward to its inputs and interpreted as several probes placed on all signals driving the gate. TI circuits [NRS11] are indeed secure under the glitch-extended probing model as they maintain the security in the presence of glitches.
Despite its sound theoretical basis, realizing the TI variant of non-linear functions is not straightforward. Under the TI settings, a function with algebraic degree t should be split into td + 1 shares to achieve security against d th order attacks. This leads to high area overhead (and/or latency) for functions with a high algebraic degree due to a high number of input shares. As an essential underlying assumption of masking schemes, uniform sharing should also be achieved at the output of a masked (TI) function. It is, however, not trivial to achieve this for every given function. Even a uniform TI with a minimum number of input shares does not exist for some functions, e.g., 2-input AND gate and Keccak non-linear function χ [BDN + 13]. Nonetheless, this can be solved by insertion of fresh masks (refreshing the sharing). However, if a uniform TI is found for non-linear functions of a cipher, it does not require any fresh randomness. In this case, the entire masked cipher can be implemented with td + 1 shares, while only the primary inputs (plaintext and key) should be presented in the shared form.
By introducing a new technique denoted as changing of the guards [Dae17], it is shown that uniform sharing for a bijective S-box can be achieved by making use of independent shares of the cipher state as fresh masks. This technique has been applied on a 3-share implementation of Keccak non-linear function χ [Dae17], on a 4-share AES S-box decomposed to cubic functions [WM18], and on a 3-share tower-field representation of the AES S-box [Sug19]. All these implementations require fresh randomness just for the first execution of the cipher. Subsequent executions can proceed without any fresh masks.
The td + 1 requirement has been relaxed in [RBN + 15, GMK16] by showing how to use only d + 1 shares when d-th order security is desired. Although this allows realizing masked circuits with less area overhead, it mostly forces to use fresh randomness. Application of such a methodology on AES led to 2-share masked designs reported in [CRB + 16, GMK17], requiring between 18 and 54 fresh mask bits per clock cycle.
As a side note, it has been tried to reduce the required fresh randomness of such d + 1 hardware masking schemes. For example, a combination of the multiplication algorithm in [BDF + 17] and randomness optimization in [BBP + 16] led to the scheme presented in [GM17] which -compared to [GMK16] -reduces the number of fresh masks for higher-order hardware implementations of the multiplication. There exist also activities in reducing the latency of d + 1 hardware masking schemes [GIB18], which even leads to a higher number of required fresh masks.

Our Contributions
In this work, we provide techniques that allow us to realize d + 1 hardware masking schemes without any fresh masks for d = 1, i.e., only first-order secure implementations with two shares. We start our study with a 2-input AND operation and show how to construct its 2-share variant without any fresh randomness while achieving glitch-extended probing security. As a side note, this has never been achieved/reported in state of the art. We generalize our strategy and provide a glitch-extended-probing-secure representation of larger functions, including the S-boxes of Midori, PRESENT, PRINCE, and AES without any fresh randomness. We would like to highlight that the aforementioned changing of the guards is not used in our constructions, and they are the only ones reported so far in the literature with this feature. Hardware platforms are our main target implementation basis. However, this does not hinder our constructions to be used as a sequence of instructions to run on a software platform. Our entire developments, including the source codes and the HDL representation of the constructed S-boxes and full ciphers, are given in the github. In addition to our simulations and FPGA-based practical investigations, we have evaluated our constructions by the recently-introduced leakage verification tool SILVER [KSM20], which is also available online.

Preliminaries
In this section, in addition to the notations, we give the preliminary knowledge necessary and helpful to follow the rest of the paper. This includes the fundamentals of masking in hardware and various state-of-the-art techniques to realize the masked variant(s) of a given function in hardware.

Notations and Definitions
We denote binary random variables ∈ F 2 with lower-case italic x, vectors ∈ F n>1 2 with upper-case italic X, j-th element in a vector X with superscripts x j , i-th share of a variable with subscripts x i , coordinate functions with lower-case italic sans-serif f (.), functions with larger output width by upper-case italic sans-serif F (.), and sets with calligraphic font F.
In an s-th order Boolean masking, the secret x is represented by s+1 shares ( The challenge is how to apply a non-linear function F (.) in such a masked form.

Threshold Implementations
For simplicity, let us consider the case where the number of input and output shares are the same. The masked variant of Y = F (X) receives input shares X 0 , . . . , X s and provides where at least one input share X i is missing in its input list. This, referred to as non-completeness, guarantees the leakage of F i (.) to be independent of X. Further, for each value of X, giving all possible sharing X 0 , . . . , X s to the masked function leads to a set of Y 0 , . . . , Y s which should be a uniform sharing of Y = F (X). Not fulfilling the uniformity would potentially result in a leakage in subsequent function(s), which receives Y 0 , . . . , Y s as the shared input. Classical TI defines the minimum number of input shares as td + 1, where t stands for the algebraic degree of the underlying function F (.) and d the desired degree of security. This leads to have at least 3 input shares for the smallest non-linear function (t = 2) and first-order security d = 1.
Achieving a uniform and non-complete TI becomes particularly more challenging for functions with a high algebraic degree. Hence, it is usually tried to decompose the target function into smaller (preferably quadratic) functions and achieve masked variants of each one separately [BNN + 15]. This necessitates placing registers between each two consecutive masked functions to avoid the propagation of glitches.
It is easy to achieve non-complete component functions, e.g., by following the direct sharing technique [NRS11]. However, fulfilling the uniformity is not trivial, and even not necessarily possible for every function. For example, the uniform TI of any 2-input nonlinear function (e.g., an AND gate) with three input shares does not exist [NRS11]. In such cases, by employing fresh randomness (fresh masks), the output shares are re-shared, hence fulfilling the uniformity. This concept has been used in all state-of-the-art TI of the AES S-box, e.g. [BGN + 14b, MPL + 11]. Note that changing of the guards [Dae17] relaxes the necessity of having fresh masks at every clock cycle by using the input shares of an S-box as the fresh masks for the next neighboring S-box(es), e.g., in [Sug19, WM18] for the AES S-box.

Probing Security
Security of masking schemes is commonly evaluated by the probing security model [ISW03], where the number of probes which the adversary can put on the intermediate signals (variables) of the circuit (design) reflects the order of the attack [BDF + 17, DDF14]. Compared to software implementations, where the operations are performed sequentially, and each operation can be modeled as an atomic gate whose output changes once per evaluation, hardware implementations are prone to glitches. In other words, the changes (toggles) at a gate output can propagate to the further driven gates. It means that by putting a probe on a gate output, the adversary not only observes its changes but also obtains a fraction of changes on former gates that drive the probed gate. This initiated a vast number of research on how to adjust the probing security model, considering glitches. This has led to the introduction of glitch-extended probing model [FGP + 18], where a probe placed on a gate output is propagated backward and extended to multiple probes at the inputs of the combinatorial circuit which drive the probed gate. Since we deal with hardware implementations in this article, we consider the glitch-extended probing model in our evaluations and assessments.
We further mainly focus on first-order security. Unless otherwise stated, we refer to a first-order secure/vulnerable design by omitting the term 'first-order.'

Masking with d + 1 Shares
It has been shown in [RBN + 15, GMK16] that it is not necessary to follow td + 1 rule for the number of input shares to construct a secure masked hardware implementation. Instead, d-th order (glitch-extended probing) security can be achieved by d + 1 input shares. This can be done by dividing the given function into two register-isolated parts and introducing fresh randomness. For example, for a 2-input AND gate x = f (a, b) with a 0 , a 1 , b 0 , b 1 as input shares and x 0 , x 1 as output shares, we can write 4 component functions with r being a fresh mask. Note that the result of component functions are stored in registers x 0 to x 3 , and the part, which XORs the registers' output to make the output shares x 0 and x 1 , is referred to as compression layer. Without any particular restrictions, fresh mask r can be added either to f 0 or f 1 and either to f 2 or f 3 . However, in Domain It means that we can write which provides a uniform and first-order secure sharing of x = ab + b. In order to show its security under the glitch-extended probing model, we provide Table 1 Table 1, that for all possible sharings of each input value a, b it is 1 only once, and 0 three times. The same holds for x 1 to x 3 . By placing a probe on an output share, e.g., x 0 , the glitch-extended probing model extends it to two simultaneous probes placed on x 0 and x 1 . In column 4P (x 0 , x 1 ), we show a factor of the joint probability of such probes. It can be seen that independent of the input value a, b, the probes jointly see two times (0, 0), once (1, 0), and once (0, 1). We refer to this as identical joint probability distribution. Since the same is seen when a probe is placed on the other output share x 1 , the design is first-order glitch-extended probing secure. Looking at the distribution of the output shares P (x 0 , x 1 ), it can be seen that for each input value a, b both possible output sharing of x happens equally likely, which indicates the uniformity of this construction as the sharing of ab + b. By representing the 2-input AND asāb + b, withā the inverse of a, we can apply the same construction in Equation (2) and present the sharing ofā by (ā 0 , a 1 ) or (a 0 ,ā 1 ). This automatically leads to a glitch-extended probing secure and uniform sharing of the 2-input AND gate. Note that the intermediate signals, distributions, and discussions given for ab + b stay the same forāb + b. We would like to stress that it is the first time that a secure 2-share masked AND gate without any fresh randomness is constructed. We have observed that -in contrast to the other cases -the construction given for Q 4 300 : 01234589DC76BAFE 1 is not secure. We just show the given construction for the 3rd output bit z = h(a, b, c) = ab + bc + c as follows, where 6 component functions are defined.
Under the glitch-extended probing model, placing a probe on z 0 is extended to three simultaneous probes on z 0 , z 1 , and z 2 . Simulating the intermediate signals shows that the joint probability distribution P (z 0 , z 1 , z 2 ) is not identical for all input values a, b, c, indicating its insecurity. We have also confirmed these findings using SILVER. The nonuniformity of the 4-bit shared output of this construction is also reported in [KSM20].

Technique
Our findings with respect to the 2-share masked AND gate without any fresh randomness (in Section 2) motivated us to look for generalizing the technique. As a result, we constructed a generic procedure allowing us to find glitch-extended probing secure larger constructions without any fresh masks. Below we give this procedure by focusing on small 2-input coordinate functions and later extending it to larger functions.

2-input Quadratic Functions
The trick we used to build the secure 2-share AND gate with no fresh mask in Section 2.4 cannot be generalized to arbitrary functions. Therefore, we construct a general strategy. Let us consider a constant-free arbitrary quadratic function with two inputs x = f (a, b), i.e., f (0, 0) = 0. Since its shared variant -in addition to any other linear term -has quadratic terms a 0 b 0 , a 0 b 1 , a 1 b 0 , and a 1 b 1 , we have to use four component functions f 0 (a 0 , b 0 ), f 1 (a 0 , b 1 ), f 2 (a 1 , b 0 ), and f 3 (a 1 , b 1 ). We follow the below steps; the search algorithm is also given in Algorithm 1.
1. We start with making the set F 0 , including all possible 2-input constant-free coordinate functions for f 0 (a 0 , b 0 ) which have a 0 b 0 in their Algebraic Normal Form (ANF), and similarly for the other component functions. Apparently, the cardinality of each set is 4.
2. Supposing that f 0 (.) and f 1 (.) are compressed to make an output share x 0 (similar to Equation (2)), in the second step, we search for tuples in F 0 × F 1 which i) whose outputs are jointly statistically independent of input a, b, and ii) their XOR (i.e., x 0 ) is a balanced function. The first condition is to achieve security in glitch-extended Algorithm 1 Search for fresh-mask-free sharing of 2-input quadratic function end if 28: end for probing model, i.e., identical joint probability distribution, see Table 1 with respect to P (x 0 , x 1 ). The second condition is necessary to achieve uniformity [KSM20]. Those tuples which fulfill both conditions are added to the set F 0,1 . The same is repeated for the other two component functions f 2 (.) and f 3 (.) and the set F 2,3 is made.
3. In the last step we need to find tuples in F 0,1 × F 2,3 whose XOR makes a sharing of x, i.e., x 0 + x 1 = x, i.e., the correctness property of TI [NRS11]. In order to efficiently proceed in this step, we store the ANF of the XOR result of each tuple of F 2,3 (i.e., x 1 = x 2 + x 3 ) in a searchable list (e.g., an indexed sorted linked list). By selecting an element in F 0,1 , we first make the ANF of XOR of its tuples, which is the ANF of x 0 = x 0 + x 1 . By replacing every variable a with a 0 + a 1 (resp. for b) in the ANF of the given function x = f (a, b), we know what should be the ANF of its sharing. By XORing these two ANFs, we can directly obtain the ANF of the desired x 1 . Hence, we look into the aforementioned searchable list to check whether there is a tuple in F 2,3 with the desired ANF for x 1 . If so, the found component functions make a correct, non-complete, uniform, and glitch-extended probing secure sharing We should refer to the classical TI design process [NRS11], where by direct sharing, noncompleteness and correctness properties are fulfilled. Later, by the addition of correction terms, it is tried to achieve a uniform sharing. In the above-expressed procedure, we first construct component functions that fulfill non-completeness and uniformity. Then, we search for a combination that fulfills correctness.
Note that if the given function is not constant free, it should be first made so by x = f (a, b) + f (0, 0). After constructing the secure sharing of x, the constant can be added to just one of the component functions leading to a correct and secure sharing of f (.).
By applying this procedure on a 2-input AND gate, we found eight solutions, including the one shown in Section 2.4. We should highlight that as given above, we considered a configuration where component functions f 0 (.) and f 1 (.) are compressed to make an output share x 0 . This is actually not a must, we can take another configuration where f 0 (.) and f 2 (.) are compressed (resp. f 1 (.) and f 3 (.)). This leads to another set of 8 solutions for the 2-input AND. We provided all these solutions in the github. Note that having f 0 (.) and f 3 (.) in a compressed layer does not lead to any solution.

3-input Cubic Functions
Here, we extend the procedure to arbitrary 3-bit cubic constant-free coordinate function x = f (a, b, c). Due to its cubic term abc, we have to use 8 component functions The first step is similar to that of the 2-input case, i.e., sets F 0 to F 7 are made, covering all possible 3-input cubic coordinate functions corresponding to each component functions. As a side note, each of such sets has 64 elements.
In the second step, we first suppose that component functions f 0 to f 3 are compressed to provide the output share x 0 . We, hence, need to search for tuples in F 0 × F 1 × F 2 × F 3 satisfying the conditions expressed in the second step in Section 3.1. The first condition, i.e., identical joint probability distribution, helps to optimize the search process. That is, if joint probability distribution P (x 0 , x 1 , x 2 , x 3 ) is independent of inputs a, b, c, the same holds for the joint probability distribution of every two and every three selection of x 0 , x 1 , x 2 , x 3 . Therefore, we first look for tuples in F 0 × F 1 fulfilling the identical joint probability distribution condition. Afterward, the set is extended by expanding the tuples by one more element ∈ F 2 while still satisfying this condition. This is continued to have tuples in F 0 × F 1 × F 2 × F 3 . At this step, the second condition, i.e., balancedness, is examined, and the set F 0,1,2,3 is formed having the tuples which satisfy both conditions. The same process is followed to construct the other set F 4,5,6,7 including the tuples in The last step is identical to that explained as the third step in Section 3.1. In short, the sorted list of ANF of the XOR result of elements in F 4,5,6,7 and the ANF of the target masked function help us to rapidly find the matching tuples in F 0,1,2,3 and F 4,5,6,7 .
We applied this technique on 3-input AND function. To give an overview on the complexity of the explained search process, each set F 0,1,2,3 and F 4,5,6,7 contains 5 120 tuples, and our program in C++ using a single CPU needs around 6 seconds to generate 10 368 constructions, each of which a glitch-extended probing secure and uniform sharing of 3-input AND without any fresh randomness.
Similar to the 2-input case, there is no necessity to force component functions f 0 to f 3 to be compressed. The component functions can be arbitrarily divided into two parts, but every division does not necessarily make a solution. In our investigations, we found 186 720 such secure constructions for the 3-input AND. Below, we give one of such constructions, while our entire findings are provided in the github.

4-input Cubic Functions
Not as a unique configuration, this allows us to realize the sharing of any (at most cubic) 4-bit function. In other words, these component functions support any cubic term. For example, if the ANF of f (.) contains the term acd, each term of its sharing a 0/1 c 0/1 d 0/1 fits to one of these component functions.
At the first step, we should make the sets F 0 to F 7 for each component function respectively.
• If f (.) is cubic, we fill each F i∈{0,...,7} with all possible constant-free cubic coordinate functions whose cubic terms are exactly those of f (.). This is to guarantee that the shared function fulfills the correctness property of TI.
• If f (.) is quadratic, each F i is filled with all possible constant-free quadratic and linear functions irrespective of the terms of f (.). In this case, we further add a constant function f i (.) = 0 to F i ; this helps to cover the cases were we do not need to use all 8 component functions.
• In case of a linear f (.), we obviously do not need to search for any constructions; f (.) can be applied on each set of shares a 0 , b 0 , c 0 , d 0 and a 1 , b 1 , c 1 , d 1 independently.
The next second and the third steps are exactly identical to those given for 3-input cubic functions in Section 3.2.
Here, an important point is with respect to the way the component functions are defined. As stated, the configuration given in Equation (5) is not the only possible one. It can be seen that shares of different variables are differently assigned to component functions. We indeed found four different ways to do such assignments, shown below for an exemplary input variable w.
Input variables a, b, c, d can take any of such ways to be assigned to component functions. However, it should be taken into account that the input variables which are jointly in a non-linear term of the target function f (a, b, c, d) cannot similarly be assigned to component functions. Otherwise, the correctness property of TI cannot be fulfilled. Based on our observations, depending on the target function, several glitch-extended probing secure and uniform solutions for the sharing of arbitrary (at most cubic) f (.) can be found by changing the way the input shares are assigned to component functions. We deal with several corresponding case studies in the next section.

Case Studies
This section provides a couple of case studies where we have applied our technique to realize the 2-share secure implementation of different ciphers without any fresh masks.

Midori
As the first case study, we focus on Midori-64 [BBI + 15], where a 4-bit S-box F (a, b, c, d) : CAD3EBF789150246 is used. Note that the same S-box is used in the design of CRAFT [BLMR19]. Since each of its 4-bit coordinate functions is at most cubic, we can easily apply the technique expressed in Section 3.3 to find solutions for uniform and glitch-extended probing secure 2-share constructions for each coordinate function. We have found 112 128, 32 256, 112 128, 17 346 048 solutions for each coordinate function respectively. Our program running on a machine with 24 CPU cores and 96 GB of RAM required 115 minutes to generate all these solutions.
In the next step, we need to find a combination of these solutions (one for each coordinate function), which are jointly uniform. As a side note, since no fresh mask is used, the output sharing of different coordinate functions are not necessarily jointly uniform. Since the number of possible combinations is very high, we should optimize the search process. If four shared coordinate functions are jointly uniform, any two and any three selection are also jointly uniform. Therefore, we can first find two jointly-uniform solutions (for two component functions), then search for the third one to be jointly uniform with the first two, and so on for the fourth component function. We have found millions of such combinations that are jointly uniform, one of which is given in Appendix B. Note that the second coordinate function of the Midori S-box y = g(.) is quadratic. Therefore, its sharing has 4 component functions instead of 8 compared to the other shared coordinate functions.
The component functions of different coordinate functions which receive the same input shares can be combined in a single combinatorial circuit. For example, we refer In other words, a combinatorial circuit which receives a 0 , b 0 , c 1 , d 0 can provide four outputs to be individually stored in registers x 2 , y 0 , z 1 , t 2 . This neither violates the non-completeness nor affects the glitch-extended probing security of the construction. Therefore, for the sake of area efficiency, this factor can be considered when searching for a joint-uniform combination of shared coordinate functions. Figure 1 shows a general block diagram of our technique. Using these graphics, we would like to stress that the output of the compression layer (x 0/1 , y 0/1 in this case) cannot be freely given to a linear/non-linear function. If the subsequent function makes use of different output bits in a combinatorial circuit, e.g., x 0 and y 0 , the glitch-extended probing security model extends a probe on such a gate to all x 0 , x 1 , x 2 , x 3 and y 0 , y 1 . Hence, these signals should have an identical joint probability distribution independent of the inputs a, b, c, d. This condition has not been considered when searching for combined jointly-uniform shared coordinate functions. We have also examined this on the solutions we found for Midori S-box. No solution can fulfill such a condition. Therefore, placing a register after the compression layer is necessary if the subsequent function mixes different output bits of the shared function. Note that such a register is not required when a fresh mask is used for each coordinate function, e.g., in [GMK17, CRB + 16].
Based on our construction and observations, we have designed a 2-share round-based implementation of Midori-64 encryption/decryption function without any fresh masks. The design architecture is shown in Figure 2, which is similar to that of [MS16a]. Since the Midori's MixColumns does not mix different output bits of any S-box, we did not need to place a register after the compression layer, but it is placed at the input of the S-box, i.e., the output of the MixColumns of the former cipher round. As a comparison to state of the art, we are only aware of a 3-share classic TI design of the Midori-64, which is also free of fresh masks, reported in [MS16a]. In this design, the S-box is decomposed to two quadratic bijections, allowing to represent shared version of each one by a uniform TI using three shares. Similar to our design, it has two register stages, hence the same latency with respect to the number of clock cycles. Note that no key masking is used in [MS16a]; hence, in order to provide a fair comparison, we also did not share the key path in our Midori design. We refer to Table 2, where we report the performance figures of our constructions compared to state of the art. Notably, our construction is slightly larger than the 3-share version [MS16a]. This is due to having more registers at the output of the component functions. Each S-box in our design needs 28 registers (see Appendix B), while the 3-share uniform TI needs 12 registers. As an advantage, the initial masking of the plaintext requires 64 mask bits in our design while it is 128 bits in the 3-share design.

PRESENT
We applied the same technique on the PRESENT S-box [BKL + 07] F (a, b, c, d) : C56B90AD3EF84712. Since the process is exactly the same as that of Midori S-box, we omit to re-explain the steps in detail. We found 551 424, 1 152, 5 417 472, and 1 152 uniform and glitch-extended probing secure solutions for its coordinate functions respectively, in 107 minutes using the same machine expressed in Section 4.1. We further found millions of jointly-uniform combined solutions (one for each coordinate function). One of such solutions is given in Appendix C.
In order to compare our construction to state of the art, we have taken the design of [PMK + 11], where the S-box is decomposed in two quadratic bijections allowing to achieve uniformity with three shares at each stage without any fresh masks. Therefore, we could easily replace its S-box with our construction and change the number of shares to 2. As given in Table 2, our design is smaller than that of [PMK + 11] due to a reduction in the number of shared state registers.

PRINCE
The application of our technique on PRINCE [BCG + 12] is not as straightforward as the former cases. PRINCE makes use of the S-box F (a, b, c, d) : BF32AC916780E5D4 and it is inverse B732FD89A6405EC1 in both encryption and decryption procedures. Based on the fact that the PRINCE S-box and its inverse are affine equivalent, a round-based implementation using only the S-box and some affine functions has been introduced in [MS16a]. The use of our S-box constructions in this strategy would lead to several register stages, as explained in Section 4.1. Therefore, we followed another design architecture shown in Figure 3, where both S-box and S-box inverse are implemented. However, their compression layer is shared as at every cipher round, either the S-box or its inverse is used.
Application of our technique on the S-box led to 4 478 976, 17 346 048, 17 346 048, and 112 128 solutions for its coordinate functions, which took around 4 hours. For the S-box inverse we have found 24 576, 10 106 880, 70 957 824, and 99 84 solutions in approximately 11.5 hours. Finding a jointly-uniform combination of these solutions is obviously challenging as the number of possible combinations explodes. We have used one more trick to optimize this search process. Let us focus on a single coordinate function y = f (a, b, c, d). In the solutions found for this coordinate function, the ANF of the outputs of the compression layer y 0 = y 0 + y 1 + y 2 + y 3 and y 1 = y 4 + y 5 + y 6 + y 7 are not unique. In other words, there are several solutions with the same ANF for y 0 and y 1 , in which the component functions generating y 0 to y 7 are different. The component functions affect the uniformity and glitch-extended probing security, while y 0 , y 1 affect only the uniformity. Therefore, to find a joint-uniform combination of the shared coordinate functions, we can just consider those solutions that have a unique ANF for the shared output. In other words, we can shrink the found solutions by considering only one solution for each ANF of y 0 , y 1 , and the same for the other coordinate functions. Application of this strategy led to 2 688, 5 952, 5 952, and 1 536 such solutions for the S-box coordinate functions and 96, 2 880, 7 728, and 1 056 solutions for the coordinate functions of the S-box inverse.
We have noticed that neither for the S-box nor for the S-box inverse, there is a jointlyuniform combination of the found solutions. It is noteworthy to mention that this is not related to the cubic class to which the PRINCE S-box belongs. Putting some linear bijections at its input and/or output can lead to combined solutions with joint uniformity. As an example, by composing the S-box with A(x, y, z, t) = (x, y + z, z, t), we found several jointly-uniform combined solutions. However, we have to apply the inverse of A after the compression layer, which necessitates a register stage in their between (see Section 4.1 and Figure 3). Independent of this, we have found solutions for both S-box and its inverse. The first, third, and fourth shared coordinate functions are jointly uniform, but not with the In order to construct a secure implementation of the cipher, a single-bit fresh mask can be applied to the second shared coordinate function. However, we refer to the cipher structure in Figure 3 and highlight the specification of the PRINCE M -layer [BCG + 12]. Every output bit of the M -layer is the XOR of its 3 input bits, which are the output of different S-boxes. In other words, output bits of an S-box (resp. S-box inverse) are never mixed in the M -layer. Therefore, as shown in Figure 3, we did not put a register between the compression and M -layer. Further, since every shared coordinate function is individually uniform, and as stated, 3 output bits of different S-boxes (with independent sharing) are XORed to make a single bit output of the M -layer, sharing of every output nibble of the M -layer (going to the next S-box/S-box inverse) becomes jointly uniform. Hence, there is no need to use a fresh mask for the second shared coordinate function.
We are aware of two works dealing with masked hardware implementation of PRINCE. In [MS16a], the S-box is decomposed to three quadratic bijections allowing to obtain its uniform sharing with three shares without any fresh masks, i.e., three clock cycles per encryption/decryption round. In [BKN19], the authors considered d + 1 masking and did not decompose the S-box, as we do in our construction. Each first-order masked S-box in their design requires 12 fresh mask bits while using a form of mask reuse, the authors could reduce the required fresh masks to 48 bits per clock cycle in a round-based implementation with two clock cycles per cipher round. Our round-based implementation supporting both encryption and decryption also has two register stages per cipher round but does not need any fresh mask bits. Table 2 shows a comparison between the performance of these designs. Since the key path was masked in [BKN19], but not in [MS16a], we provided both designs with and without key masking enabling a more meaningful comparison.

AES
For the AES S-box, similar to several state-of-the-art works, e.g., [Can05, CRB + 16, GMK17, GMK16], we followed a tower-field approach for the inversion in GF (2 8 ). Apart from the input and output isomorphisms, which have been taken from the Canright's design [Can05], Figure 4 depicts a block diagram of the inversion in GF (2 4 ) 2 . We presented the design using four blocks, the middle one: inversion in GF (2 4 ), the last blocks: GF (2 4 ) multiplier, and the first block: a combination of square-scale and GF (2 4 ) multiplier.

Inverter
We start with the GF (2 4 ) inverter. We have taken F (a, b, c, d) : 0132ED8AF67C495B which is affine equivalent to the cubic class C 4 282 [BNN + 15]. Application of the technique explained in former sections led to 4 478 976, 5 417 472, 70 957 824, and 140 011 008 glitch-extended probing secure and uniform solutions for its coordinate functions respectively. We also found several jointly-uniform combined solutions leading to uniform and secure sharing of the GF (2 4 ) inverter with two shares and no fresh masks. We give one of such solutions in Appendix F.

Multiplier
The two multipliers as the last blocks of the GF (2 4 ) 2 inverter (see Figure 4) are identical 8-bit to 4-bit quadratic functions. Therefore, we require four component functions for each coordinate function to cover the quadratic terms. Considering a coordinate function f (a, b, c, d, e, f, g, h), an important question is how to assign the input shares to the component functions f 0 (.), . . . , f 3 (.). Taking a single input variable w into account, we can assign its shares w 0 , w 1 to the component functions as follows.
Similar to what expressed in Section 3.3, the shares of input variables which are jointly in a non-linear term should be differently assigned to the component functions. It is important to highlight that since the underlying module multiples two 4-bit inputs, its quadratic terms have always a variable from a, b, c, d and the other one from e, f, g, h . Therefore, shares of a, b, c, d can be identically assigned to component functions, and the same for shares of e, f, g, h. As an example, for the first coordinate function x = f (a, b, c, d) = b + d + ae + ce + af + bf + cf + df + cg + ah + bh, we can consider the following settings. a 0 , b 0 , c 0 , d 0 , e 1 , f 1 , g 1 , h 1 )   f 2 (a 1 , b 1 , c 1 , d 1 , e 0 , f 0 , g 0 a 1 , b 1 , c 1 , d 1 , e 1 , f 1 , g 1 , h 1 ) Considering all possible ways to assign the shares to the component functions, we have found millions of uniform and glitch-extended probing secure solutions for each component function. We further have easily found several solutions as their combination, which fulfill the joint uniformity without any fresh masks. One of such solutions is given in Appendix G.

Square-Scale-Multiplier
Having the uniform and glitch-extended probing secure construction for the GF (2 4 ) inverter and the GF (2 4 ) multiplier, the remaining part is the first block (see Figure 4). We intentionally combined the square-scale and the first multiplier to an 8 to 4-bit quadratic function. This helps us to achieve uniformity. Otherwise, having a uniform shared multiplier, we do not necessarily obtain a uniform sharing when it is XORed with the square-scale module's output, since the multiplier and the square-scale have common inputs. Similar to what we have done for the multiplier in Section 4.4.2, we have found several probing secure solutions, which are also jointly uniform. However, connecting all these secure modules together based on the block diagram in Figure 4 does not necessarily lead to a secure implementation. The problem is the multipliers at the last stage, which receive the output of the GF (2 4 ) inverter as well as either 4-bit LSB of the primary input P or its 4-bit MSB Q. Since the output sharing of the GF (2 4 ) inverter depends on the sharing of the primary inputs, uniform sharing at the multipliers' input is not guaranteed. Therefore, sharing of the output of the GF (2 4 ) inverter should be jointly uniform with sharing of P for the bottom multiplier, and jointly uniform with sharing of Q for the top multiplier. Since our uniform shared GF (2 4 ) inverter is made without any fresh masks (i.e., is a bijection), this condition should be fulfilled by the shared square-scale-multiplier. In other words, output sharing of the square-scale-multiplier should be jointly uniform  with P as well as with Q. We have added this condition to the search program when looking for a combination of shared coordinate functions of the square-scale-multiplier, which did not lead to any solution. Instead, we found two other alternatives: • We found several solutions for the shared square-scale-multiplier, whose all four shared outputs are jointly uniform. At the same time, their first three shared outputs are jointly uniform with P as well as with Q. This means that we can make use of a single-bit fresh mask to refresh the sharing of the remaining output. This allows us to use only 1-bit fresh mask to achieve a glitch-extended probing secure GF (2 4 ) 2 inversion. We give the details of such a shared square-scale-multiplier in Appendix H. This construction is also shown in Figure 5(a), where every stage should be isolated by means of registers.
• We found two distinct solutions for the shared square-scale-multiplier, each of which with a jointly-uniform output sharing. One of such is jointly uniform with P and the other one jointly uniform with Q. This implies instantiating two GF (2 4 ) inversion modules, as shown in Figure 5(b), but allows realizing the shared GF (2 4 ) 2 inversion fully without any fresh masks. This construction for sure has a higher area overhead compared to the former solution. The detail of such found solutions are given in Appendix I and Appendix J. We would like to highlight that the non-linear terms in these two constructions are similarly assigned to their component functions. This allows us to combine every component function of one of the constructions with a component function of another construction. In other words, a component function is made, which generates two outputs: one for the square-scale-multiplier, which is jointly uniform with P and one for the other square-scale-multiplier, which is jointly uniform with Q. This is beneficial to reduce the area overhead of its implementation.
Note that in both above given solutions, the output sharing of the top multiplier is jointly uniform. The same holds for that of the bottom multiplier, but they are not jointly uniform. In other words, our constructions are glitch-extended probing secure, but their output cannot be given to the next function which mixed the output of the top and bottom multipliers. This includes the output isomorphism (to convert from GF (2 4 ) 2 to GF (2 8 )) as well as the affine transformation of the AES S-box applied after the GF (2 8 ) inversion.
In order to solve this problem, and construct a secure AES encryption module, we make use of the features of the AES MixColumns, which similar to the PRINCE M -layer can overcome the joint non-uniformity issue. However, since the multiplication-by-2 and multiplication-by-3 of the MixColumns combine different bits of each S-box output, we divide the MixColumns in two parts. Let us recall the MixColumns operation, where a 4 × 4 matrix is multiplied by a vector of four S-box outputs A, B, C, D. If we denote the output of the GF (2 4 ) 2 inversion of the corresponding S-boxes by A , B , C , D and the composition of the output isomorphism and the affine transformation by OA(.), we can write  where X, Y, Z, T denote the MixColumns output (of a column). We divide this matrix multiplication into two parts as Since all elements of β are 0/1, we can move the application of OA(.) between these two matrix multiplications. More precisely, Since every row of β contains three 1s, and each bit of A , B , C , D has a uniform sharing, and each S-box input has a uniform, and independent sharing, each byte of X , Y , Z , T becomes jointly uniform. Therefore, application of OA(.) on X , Y , Z , T would not lead to any leakage. Note that X , Y , Z , T should be stored in a register before the application of OA(.). In other words, using this technique, the application of MixColumns needs two clock cycles. We followed this strategy and constructed a masked serialized AES encryption module with two shares without any fresh mask (resp. with 1-bit fresh mask), which is explained in detail below.

AES Encryption
Our serialized AES encryption module, in which both variants of our masked GF (2 4 ) 2 inversions can be plugged, requires 246 clock cycles to accomplish a full encryption. Figure 6 shows an overview of the datapath of the design. The state registers (resp. key registers) are viewed as a 4 × 4 square array of bytes, and the byte located at row i and column j is denoted as 4 × j + i. In the first 16 clock cycles, the key and plaintext are loaded byte-wise to the module. Meanwhile, the AddRoundKey and the input isomorphism IA(.) are performed, and the result is fed into the GF (2 4 ) 2 inversion. Note that we have to place a register between the IA(.) and the inversion. Otherwise, the first stage of the GF (2 4 ) 2 inversion violates the non-completeness property. The same has been applied in [CRB + 16, GMK17]. The next four cycles are spent on dedicating the IA(.) and inversion to the key schedule procedure. When the last state byte comes out of the inversion module, the ShifRows is applied, which is not shown in the figure for the sake of simplicity. Afterward, the MixColumns operation is performed in parallel to the AddRoundKey and IA(.) of the next encryption round. As mentioned in Section 4.4.3, OA(.) cannot be applied right after the inversion. It is integrated into the MixColumns, which forces us to put a register after the multiplication by β matrix to guarantee the non-completeness. The same holds for the key schedule. In order to generate the round keys, the output of the inversion is XORed with OA −1 (.) of the corresponding key byte followed by a register to avoid glitches and any potential leakage. Then, OA(.) and corresponding RCON are XORed to the registered result, which provides the correct next round key byte. In short, each encryption round takes 23 cycles to finish the inversions and MixColumns entirely.
The state register contains the output of the GF (2 4 ) 2 inversion (not the S-box), and OA(.) is integrated into the MixColumns. Since the MixColumns is missing in the last cipher round, we have to apply OA(.) after the last AddRoundKey to generate the ciphertext. This is done by instantiating a dedicated OA(.) module separated by an output register, which is enabled only when the encryption is terminated (see Figure 6 and a register enabled by the Done signal). Hence, in the last round, the state is XORed with OA −1 (.) of the round key and the result is loaded to the output register. Similar to the other serialized AES designs, the ciphertext appears byte-wise at the output of the module. We should point out that, in our design, many registers are placed right after a multiplexer. This allows the synthesizer to make use of scan flip-flops, a technique commonly used in the state of the arts.
As a comparison to similar works, we refer to Table 2. Notably, our designs outperform 3-share implementations [MPL + 11, BGN + 14b, BGN + 15] in terms of area overhead and required randomness with the same latency (# of clock cycles). Using changing of the guards, a 4-share and a 3-share masked implementation with no fresh masks have been introduced in [WM18] and [Sug19] respectively. The former has 11 times longer runtime, and the latter has more than double area overhead compared to our designs. The randomness complexity of our designs is the best among all 2-share implementations while their area overhead is slightly larger than [CRB + 16] and is almost the same compared to the constructions presented in [GMK16], and smaller than [UHA17]. There is a mixture of having and not having key masking in the designs mentioned above. Therefore, we provided two versions of our designs, one with the shared key state and another without, indicated by a column in Table 2.
Comparison with changing of the guards. As stated, the technique introduced in [Dae17] makes use of the incomplete shares of an independent state portion to remask the output of a non-uniform shared non-linear function. More precisely, it is used to overcome the non-uniformity issue when the shared non-linear function fulfills the non-completeness. For example, in [Dae17, WM18, Sug19] the underlying masking scheme is td + 1, i.e., 3 (resp. 4) shares for first-order secure realization of quadratic (resp. cubic) functions. Therefore, the non-completeness property is trivially achieved by "direct sharing" [NRS11]. Note that uniform td + 1 sharing of Midori, PRESENT and PRINCE S-boxes are realized without any fresh randomness. Therefore, application of changing of the guards on such implementations is neither necessary nor beneficial.
However, in our technique, which is a d + 1 scheme, we use 2 shares to achieve the first-order security. We show how to fulfill the non-completeness property with 2 shares and no fresh randomness, which -to the best of our knowledge -has not been reported before so far. All former d+1 constructions reported in literature, require fresh randomness for the sake of non-completeness not for uniformity. We should also highlight that application of changing of the guards on 2-share implementations to nullify the required fresh randomness does not seem trivial (or even possible). Based on this fact, to the best of our knowledge, no 2-share implementation of any cipher has been reported in literature, where changing of the guards is used. Therefore, in the comparison we provided in Table 2, we could not compare any of our 2-share designs with a state-of-the-art implementation with 2 shares and no fresh randomness.

Analysis
As stated, we have evaluated our constructions using SILVER [KSM20] confirming their first-order security under the glitch-extended probing model. As a side note, we have not considered maskVerif [BBC + 19], which is a language-based verification tool. As a matter of known issue, maskVerif has false-negative cases, i.e., it may report the insecurity of a secure design. Based on our observation and experience, we faced these cases; mainly when the given design does not use fresh masks. An example is given in [KSM20]. Since our constructions' goal is to avoid any fresh masks, we could not truly examine our designs by maskVerif.
Since verification tools, including SILVER, can only deal with parts of the given designs (i.e., gadgets), analyzing a full encryption/decryption module is still impossible. Therefore, for the sake of completeness we conducted practical analysis by implementing our constructions on an FPGA evaluation board and collecting power consumption traces.

Setup
We made use of a SAKURA-G board [SAK], where a Spartan-6 FPGA is embedded to host cryptographic cores for practical SCA evaluations. We collected the power consumption traces at a sampling rate of 500 MS/s by monitoring the voltage drop over a 1 Ω resistor placed in the Vdd path, amplified by an on-board AC amplifier. During the measurements, our designs running on the aforementioned FPGA were supplied by a stable and jitter-free clock source at the frequency of 6 MHz.

Evaluation Technique
The fixed-versus-random t-test has been used in the state of the art to evaluate the security of the masked implementations. As it is shown in [CEM18], such an analysis at first order may lead to false-negative result due to the power distribution network (also referred to as coupling effect), particularly if two shares are used, e.g., [BPG18]. In other works like [SH18] several fresh mask bits are used to overcome this issue. Since our constructions do not make use of any fresh masks, they are also prone to this issue. Therefore, we conducted attacks to evaluate the robustness of our designs. In order to be independent of any particular (hypothetical) leakage model, similar to [DMW18], we performed Moments-Correlating DPA (MC-DPA) attacks [MS16b]. For the round-based implementations (Midori and PRINCE) we performed profiling MC-DPA, where a set of traces are used to extract the model based on the leakage of an S-box, and the attack is performed on another set of traces on the same S-box module. This process examines if the attacker finds out that the same key portion (nibble in these cases with a 4-bit S-box) is used in both profiling and attack traces. For the serialized implementation (AES), we performed collision MC-DPA by constructing the leakage model based on an S-box calculation and conducting the attack on another S-box call. If successful, the attack reveals the linear difference between the corresponding key portions (bytes in case of AES).  We excluded our PRESENT design in these analyses since we just changed the S-box design compared to [PMK + 11], whose uniformity and glitch-extended probing security is confirmed by SILVER.

Results
For each of the Midori and PRINCE designs and a fixed key, we collected 100 million traces while the plaintext was selected randomly. The first 50 million traces were used to extract first-and second-order models for each S-box input. We used the second 50 million traces to conduct the first-and second-order MC-DPA attacks using the aforementioned models. The corresponding results on an exemplary targeted S-box (nibble) are shown in Figure 7 and Figure 8. The attacks confirm the first-order robustness of the designs, and as expected, the second-order attacks can exploit the leakage. We observed that around 10 million traces are more than enough to recover the correct key candidate through second-order moments. We further conducted the same attacks targeting the last round of the cipher; the corresponding results, also shown in Figure 7 and Figure 8, are along the same lines.
For the AES design, we also collected 100 million traces for random plaintexts. Due to its serialized architecture, we could use the entire 100 million traces to extract first-  and second-order models associated to an S-box input, and use the same set of 100 million traces to conduct the attack on another S-box call. The result of this procedure, which reveals the XOR difference between the corresponding key bytes, is shown in Figure 9. Like the former cases, the first-order attacks did not succeed while the second-order leakage was exploited using around 10 million traces. Note that the presented results belong to our design of the masked GF (2 4 ) 2 inversion without fresh masks (see Figure 5(b) and Table 2). Due to the similarity of the analysis results of our other design with a single-bit fresh mask (Figure 5(a)) we omit representing the identical figures.
For the sake of completeness and to verify our setup, we repeated these attacks when the initial masking is turned off, i.e., the designs are not changed, but the plaintext and the key are given when the mask for initial sharing is set to 0. The result of identical attacks on Midori and the AES is shown in Appendix K, indicating that 10 000 traces are enough to exploit the first-order leakage.

Discussions and Conclusions
In this work, we have presented a methodology that allows us to realize first-order 2-share masked realization of non-linear functions without any fresh randomness. Considering the common and reasonable glitch-extended probing model, where the effect of glitches in hardware platforms is covered, we showed how to provide the first-order secure implementation of various ciphers, including Midori, PRINCE, PRESENT and AES. Compared to state of the art -to the best of our knowledge -our designs are the only ones which i) use two shares, and ii) require no fresh masks without employing changing of the guards.