3-Share Threshold Implementation of AES S-box without Fresh Randomness

. Threshold implementation is studied as a countermeasure against side-channel attack. There had been no threshold implementation for the AES and Keccak S-boxes that satisﬁes an important property called uniformity. In the conventional implementations, intermediate values are remasked to compensate for the lack of uniformity. The remasking consumes thousands of fresh random bits and its implementation cost is a serious concern. Daemen recently proposed a 3-share uniform threshold implementation of the Keccak S-box. This is enabled by a new technique called the changing of the guards which can be applied to any invertible functions. Subsequently, Wegener et al. proposed a 4-share threshold implementation of the AES S-box based on the changing of the guards technique. However, a 3-share threshold implementation of AES S-box remains open. The diﬃculty stays in 2-input multiplication, used in decomposed S-box representations, which is non-invertible because of diﬀerent input and output sizes. In this study, this problem is addressed by introducing a certain generalization of the changing of the guards technique. The proposed method provides a generic way to construct a uniform sharing for a target function having diﬀerent input and output sizes. The key idea is to transform a target function into an invertible one by adding additional inputs and outputs. Based on the proposed technique, the ﬁrst 3-share threshold implementation of AES S-box without fresh randomness is presented. Performance evaluation and simulation-based leakage assessment of the implementation are also presented.


Introduction
Cryptography can be used in a hostile environment in which an attacker has physical access to a computational device.In such an environment, the attacker can obtain information leakage via physical side-channels such as execution time, power consumption, and electromagnetic radiation.Side-channel attack (SCA) introduced by Kocher et al. [KJJ99] exploits such information leakage to break cryptography.Subsequently, new attacks and countermeasures have been studied for more than two decades.
SCA is a serious threat in the real world, and thus, countermeasures are indispensable in applications such as smartcards.The need for countermeasures against SCA is increasing because embedded devices are increasingly used in hostile environments for the Internet of things [RSWO18].
Correlation between side-channel leakage and secret data being processed should be removed to counteract SCA.Accordingly, countermeasures based on multi-party computation (MPC) are intensively studied as a promising approach [ISW03].In the countermeasures, a target variable is split into a set of variables called a share in such a way that a proper subset will not leak information of the original variable.Subsequently, cryptographic computation is performed by using the shares without reconstructing the original value.[NRR06].Owing to its efficiency, it is considered a promising countermeasure against SCA.Since the number of shares has a significant impact to the implementation cost, constructing a target function with a minimum number of shares is a critical challenge in TI.It is known that, for an S-box with an algebraic degree of d, d + 1 shares are sufficient [NRS11].As d ≥ 2 for non-linear functions, realizing 3-share TI has been an important research challenge.To reduce the number of shares, an S-box with a high algebraic degree (e.g., d = 7 for the AES S-box) is decomposed into sub-functions with lower algebraic degrees [MPL + 11, BGN + 15, GMK17, CRB + 16, UHA17, WM18].
Unfortunately, a 3-share realization is not always available because of uniformity-a property regarding uniform distribution of shares.In particular, there have been no 3-share TIs for the NIST-standardized algorithms Keccak and AES.Therefore, fresh randomness is added during the execution to compensate for the lack of uniformity [MPL + 11, BGN + 15, UHA17].The treatment is called remasking.Accordingly, 16 to 64 fresh random bits are required for a single S-box lookup as summarized in Table 1.
The cost of the fresh random bits is a serious concern.We refer the paper by Papagiannopoulos [Pap18] for a good survey on the cost of randomness in software platforms.Generating random numbers is a challenging task in hardware platforms, too.The conventional implementations in Table 1 consume dozens of random bits for each cycle.One possible way to generate random numbers at such a high rate is to use a low-latency cryptography.De Cnudde used an unrolled PRINCE implementation as a pseudo-random number generator [CRB + 16] 1 .An unrolled PRINCE implementation by the algorithm designers uses 8.2 [kGE] and 556 pJ/cycle (8.3 mW @ 67 ns) [BCG + 12].On the other hand, the proposed AES circuit which will appear in this paper use 9.9 pJ/cycle.That means the random number generator consumes 50 times more energy per cycle than the main AES circuit.That can be a serious problem in chips with limited energy or power budgets such as near-field communication (NFC) and battery-powered devices.Another way is to generate all the random bits in advance.However, the implementations in Table 1 consume 2,560-10,240 random bits for each encryption.To store these random bits, 17.9-71.7 [kGE] of registers are needed.That is more expensive than the main AES circuit again.Consequently, randomness optimization is considered as an important research challenge [BBP + 16, FPS17,Pap18].
Recently, Daemen addressed the problem for Keccak by introducing a new technique called the changing of the guards [Dae17].The idea behind the changing of the guards technique is to construct a TI of a layer of S-boxes instead of a distinct S-box.The technique enables uniform TI for a layer of any invertible S-boxes.Consequently, a 3-share uniform TI is successfully constructed for the Keccak S-box.However, there is an obstacle to applying the method to the AES S-box.As discussed previously, the AES S-box should be decomposed to reduce the number of shares.Such a decomposed S-box involves 2-input multiplication that is non-invertible, and thus the changing of the guards technique cannot be applied.
More recently, Wegener et al. decomposed the AES S-box into stages having an algebraic degree of 3 in such a way that the changing of the guards technique is applicable.That enabled the first 4-share uniform sharing for the AES S-box [WM18].However, implementing a 4-share TI is expensive: a straightforward implementation required more than 20 [kGE] for a single S-box.Wegener et al. tackled the problem by proposing a sophisticated circuit architecture that can be realized with only 4.2 [kGE].However, the area reduction is achieved at the cost of long latency as shown in Table 1.Therefore, a 3-share uniform TI of the AES S-box is still an important challenge.

Contributions
In this study, we address the problem by introducing a certain generalization of the changing of the guards technique, which provides a generic way to construct a uniform sharing for any function.The proposed method is based on transforming a target function into an invertible one while maintaining its essential functionality unchanged.Subsequently, a TI of the transformed function is constructed.We propose the first 3-share uniform TI of the AES S-box based on the proposed technique.Subsequently, we design and evaluate a concrete 3-share AES implementation with the proposed S-box technique.Owing to the smaller number of shares, the cost of the proposed design is smaller than 1/4 that of the conventional design by Wegener et al. [WM18] in terms of the area-latency product as shown in Table 1.The security of the proposed method is evaluated through theoretical analysis and a simulation-based leakage assessment.
Organization The rest of this paper is organized as follows.In Sect.2, we briefly review the conventional works: TI, the changing of the guards technique, and the Canright's S-box implementation [Can05].Subsequently, the proposed method is described in Sect.3.1.Its relation to the changing of the guards technique is discussed in Sect.3.2.The proposed method is applied to the Canright's AES S-box implementation in Sect.4.1 and 4.2.The overall design and analysis of the proposed AES S-box implementation are shown in Sect.4.3.Sect.4.4 shows an AES circuit with the proposed S-box implementation.Sect.4.4 also provides the performance and security evaluation of the AES circuit.All the proofs are given in the Appendix.

Notation
In this paper, we discuss shares with three elements unless otherwise stated.Given a variable x, its share is represented as The terms "function" and "mapping" are used interchangeably.Sharing of a mapping ψ is given by a set of component mappings {ψ a , ψ b , ψ c }.The inputs and outputs of a mapping are denoted by small and large letters, respectively.A mapping from x to X can be expressed as either X = ψ(x) or x ψ −→ X.We denote addition over GF (2) by +.

Threshold Implementation
A mapping from x to X namely X = ψ(x) is considered.In TI, a variable x is represented as a share x = [x a , x b , x c ] s.t.x = x a + x b + x c .The overall cryptographic algorithm is implemented by using shares without reconstructing the original values.To ensure the requirement, ψ is split into three component functions namely where [X a , X b , X c ] is an output share.In TI, there are three important properties namely correctness, non-completeness, and uniformity.

Correctness
The sharing {ψ a , ψ b , ψ c } is said to be correct if the output share represents the output of the original function ψ, i.e.,

Non-Completeness
The sharing {ψ a , ψ b , ψ c } is said to be non-complete if each of {ψ a , ψ b , ψ c } uses only a proper subset of the input share [x a , x b , x c ].The component functions shown in Eq. (1) are non-complete because the component functions ψ a , ψ b , and ψ c are independent of x a , x b , and x c , respectively.

Uniformity
The probabilistic distributions of the original and shared inputs are denoted by P I (x) and P I (x), respectively.The input share is said to be uniform if and only if, for any x, its shares occur at the same probability where α is a constant.Let P O (X) be the distribution of the original output, and let P O (X) be the distribution of output shares, given uniform input shares.The sharing {ψ a , ψ b , ψ c } is said to be uniform if and only if, for any output X, its shares occur at the same probability Security of (1st order) TI is proved based on the single probing model [NRS11] in which an attacker is allowed to probe an arbitrary wire in a target.Non-completeness guarantees that leakage from either ψ a , ψ b , or ψ c is independent of x.This is because each component function has only a proper subset of the input share.More specifically, there is no information leakage about the original input if the input share is uniform.Uniformity ensures that shares are distributed uniformly even after component functions are applied.
For a mapping having an algebraic degree of d, correct and non-complete sharing can be constructed with d + 1 shares [NRS11].The number of shares is a significant concern because the implementation cost increases exponentially to d [Dae17].Therefore, an original function is usually decomposed into sub-functions having lower algebraic degrees to reduce the number of shares [MPL + 11, BGN + 15, CRB + 16, UHA17].As d ≥ 2 for non-linear functions, TI with three shares has been an important research challenge.There is a recent study on a countermeasure with a smaller number of shares [CRB + 16].However, three is still the minimum number of shares that can satisfy uniformity.

Changing of the Guards
Recently, Daemen introduced a technique called the changing of the guards that enables a 3-share uniform TI for the Keccak S-box [Dae17].The technique can be applied to any invertible S-box.Let S be an invertible S-box.Assume S has a correct and non-complete (but non-uniform) sharing given by A layer comprising L-parallel S-boxes is considered.The layer maps from [x 1 , . . ., x L ] to [X 1 , . . ., X L ] where X i = S(x i ) i.e., x i and X i are input and output of the i-th S-box, respectively.The changing of the guards sharing of the S layer is defined as follows: Definition 1 (Changing of the guards [Dae17]).The changing of the guards sharing of the S-box layer mapping from Fig. 1 shows a diagram of the changing of the guards sharing for L = 3.The idea behind the changing of the guards is to obtain a correct, non-complete, and uniform TI of a layer of S-boxes instead of a distinct S-box.An essential part is to remask the outputs from the component functions by using a neighboring share.More specifically, the non-uniform that represents 0 and is generated by the neighboring input share.
Although the AES S-box is invertible, a straightforward application of the changing of the guards is very inefficient [Dae17].This is because the AES S-box has an algebraic degree of 7 and thus, 8 shares are required.Conventionally, the AES S-box is decomposed into sub-functions having lower algebraic degrees.However, the changing of the guards technique cannot be applied to such a decomposed AES S-box.The difficulty stays in 2-input multiplication, used in the decomposed S-boxes, which is non-invertible because of different input and output sizes.Therefore, efficient application of the changing of the guards technique to the AES S-box remains open [Dae17].
More recently, Wegener et al. proposed the decomposition of the AES S-box into invertible sub-functions having an algebraic degree of 3 [WM18].By applying the changing of the guards technique to the decomposed S-box, the first 4-share uniform TI for the AES S-box is proposed.However, 3-share uniform TI for the AES S-box remains open.

Table 2: Irreducible polynomials and normal bases for the tower field representation Extension
Irreducible polynomial Normal basis

Canright's AES S-box Implementation
The AES S-box is defined based on inversion over GF (2 8 ).Efficient implementations exploiting its algebraic property have been studied so far.Notably, Canright proposed a compact implementation based on tower field representation with normal bases [Can05].
Table 2 summarizes the field extensions and bases used in the Canright's S-box implementation.
The inverse of x ∈ GF (2 8 ) is considered.There exist unique α, β ∈ GF (2 4 ) such that x = αY + βY 16 .x −1 is obtained as where The inverse of θ ∈ GF (2 4 ) is obtained similarly.There exist unique a, b ∈ GF (2 2 ) such that θ = aZ + bZ 4 .θ −1 is obtained as where There exist s, t ∈ GF (2) such that ζ = sW + tW 2 .Here, the inverse is easily obtained as Fig. 2 shows a circuit diagram for the inversion over GF (2 8 ) as described above.Note that the operations for (a + b) 2 N and (α + β) 2 µ are called squaring and scaling and are indicated by Sq.Sc. in Fig. 2.
In this study, we consider the 4-stage partitioning as shown in Fig. 2. The partitioning is based on the design by De Cnudde et al. [CRB + 16] because its symmetry is appropriate for our design.However, the number of stages is changed from 6 to 4 by merging the linear maps to the neighboring stages to reduce latency.The 1st and 2nd stages are devoted to calculating Eq. ( 5) and Eq. ( 7), respectively.The 3rd and 4th stages correspond to Eq. (6) and Eq.(4), respectively.The linear maps for the affine transformation and field isomorphism are placed in the 1st and 4th stages.

Extension and Restriction
Uniformity is closely related to invertibility.On the one hand, a sharing is uniform if it is invertible.On the other hand, the changing of the guards technique can be used if a target function is invertible.The basic idea behind the proposed method is to extend a target function into an invertible one.Invertibility ensures uniform sharing of the extended function.
Obtaining an invertible function from a non-invertible function is the main topic of reversible computing in which computers composed of invertible (i.e., reversible) primitives are studied for quantum and energy-efficient computing [Tah16].A common strategy for obtaining a reversible circuit is to add additional input and output to carry sufficient information required for inversion.An important example is a reversible 2-input AND gate known as Toffoli gate, shown in Fig. 3-(a).The Toffoli gate has one additional input and two additional outputs for the sake of invertibility.
Definition 3 (Sharing of the extended mapping).
is given by } is a correct, non-complete, and uniform sharing of ψ E .As discussed previously, the input shares should be uniform for security.If the condition is satisfied, it implies that the input shares x and y in Definition 3 are independent.It is interesting to remark that, the discussion on the uniformity of the generalized Feistel Network by Faust et al. [FGP + 18], found independently in the context of high-order masking, closely relates to this Lemma.
So far, ψ is extended to ψ E by adding m-bit additional input and n-bit additional output.ψ E provides an expected output only when the additional input satisfies y = 0. Similarly, its sharing {ψ E a , ψ E b , ψ E c } requires an input share satisfying y a + y b + y c = 0 to provide a correct output.Here, we consider the method to obtain such an additional input share.The idea is to convert the unnecessary additional output [X a , X b , X c ] to a share representing zero in the same way as the original changing of the guards technique.
We first consider the following map ψ ⊥ : A diagram for ψ ⊥ is shown in Fig. 4-(left) in which ⊥ represents a zero function such that ⊥ (x) = 0 for all x.Subsequently, its sharing is considered.
is given by In the sharing form, the zero function ⊥ is realized by [X a , X b , X c ] = [x a + x b , x b , x a ] in the same way as the original changing of the guards.The probability of observing an output share [X a , X b , X c ] is constant because Therefore, the sharing is uniform.
Finally, we define a restricted mapping composed of ψ E and ψ ⊥ .
Definition 6 (Restricted mapping).A restricted mapping ψ R is defined as , or equivalently Consequently, a uniform sharing with an additional output share [X a , X b , X c ] satisfying X a + X b + X c = 0 is obtained.The additional output is used as an additional input to the next sharing.Thus, ψ R can be calculated without using fresh randomness.The above discussion provides a generic way of constructing a uniform sharing for any function.

Security Claim
The proposed method is about constructing a uniform sharing.Therefore, the constructed sharings are secure up to the 1st-order probing model with glitches similarly to the original TI and the original changing of the guards.This is because uniformity is not enough for resistance against high-order attacks.Its extension to either 2-share schemes or high-order schemes are non-trivial and opened for future research.

Related works
There are conventional works on recycling randomness in high-order masking schemes [FPS17,Pap18].The original changing of the guards technique, as well as the proposed generalization, are different from these works on the point they focus on constructing uniform sharing.In other words, in the original and generalized changing of the guards techniques, randomness is treated similarly to other inputs and recycling is a part of a uniform sharing.

The Changing of the Guards Revisited
The proposed method is a generalization of the changing of the guards technique.Assume that a target mapping ψ has the same input and output size i.e., n = m.We consider connecting the restricted mapping ψ R in cascade as shown in Fig 5-(left).The mapping is written as Note that a glitch should be considered when non-linear layers are connected.However, this is not a problem for the connection In other words, the above discussion provides yet another proof of the uniformity of the changing of the guards sharing.However, it implies more because ψ is not necessarily invertible in the new proof.Furthermore, the proposed method can be used for a target mapping ψ having different input and output sizes.Its concrete example is given in Sect.4.2.

Application to Canright's S-box Implementation
A 3-share threshold implementation of the Canright's S-box implementation, discussed in Sect.2.4, is designed based on the proposed technique.As we are interested in the non-linear functions, we ignore the linear mappings in the following discussion.

1st and 2nd Stages
The 2nd stage shown in Fig. 2 is considered.The same discussion applies to the 1st stage for GF (2 4 ).The stage can be expressed as Note that f is a mapping for obtaining ζ in Eq. (7).x and y are preserved for a later stage.There is no uniform sharing for a mapping having an output larger than an input [WM18].Now, we consider an extended mapping of f given by [x, y, z] As f has an algebraic degree of 2, there is a 3-share correct and non-complete sharing {f a , f b , f c }. Based on Theorem 1, a uniform sharing of f E can be constructed.

Definition 8 (Sharing of f E
is given by Note that there is no additional output because both [X a , X b , X c ] and [Y a , Y b , Y c ] should be preserved for later use.Therefore, an additional input [z a , z b , z c ] satisfying z a + z b + z c = 0 should be supplied from outside so that the stage works correctly.As discussed in the next section, there are additional outputs obtained in the 3rd and 4th stages.By using them as additional inputs, the 1st and 2nd stages can be executed without using fresh randomness.

3rd and 4th Stages
The 3rd stage shown in Fig. 2 is considered.The same discussion applies to the 4th stage for GF (2 4 ).In this stage, ζ −1 b and ζ −1 a are obtained given a, b, and ζ (see Eq. ( 6)).Transforming from ζ to ζ −1 is linear and thus easy to implement, as discussed in Eq. ( 8).Therefore, we consider the remaining part.
The stage is represented by a mapping h given by As g has an algebraic degree of 2, there is a correct and non-complete sharing {g a , g b , g c }.
However, {g a , g b , g c } cannot be uniform because it is 2-input multiplication over GF (2 2 ) [NRS11].Therefore, the technique discussed in Sect.3.2 is considered.Note that this is an example of the generalized changing of the guards for a target function having different input and output sizes (n > m).The extended and restricted mappings of h are given by [x, y, z, v, w] is given by x 1 0 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 X 1 Y 1 g g 0 0 0 Additional inputs for Additional outputs g g g g the Changing of the Guards x 1 0 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 0  7 shows a detailed diagram of the sharing in Definition 10.The sharing in Definition 10 is uniform through construction as discussed in Sect.3.2.It can also be understood in comparison with the original changing of the guards.The upper half of Fig. 7 is a straightforward application of the changing of the guards except that there is no guard for Z i a , Z i b , and Z i c .Therefore, we can show that the upper half is invertible in the same manner as the original work [Dae17].The lower half is {ψ ⊥ a , ψ ⊥ b , ψ ⊥ c } in Definition 5 and thus preserves uniformity.
Corollary 2. Definition 10 is a correct, non-complete, and uniform TI of the following mapping: Notably, the unbalance between input and output sizes results in additional outputs They can be used in the 1st and 2nd stages as additional inputs.

Putting it Together
Fig. 8 shows the threshold implementation of the AES S-box based on the proposed method.The datapath width is 42 bits: 3 × 8 = 24 bits for a 3-share 8-bit AES state, 3 × 4 bits for an additional input in GF (2 4 ), and 3 × 2 bits for an additional input in GF (2 2 ).At the final stage, a share representing an S-box output is obtained along with 3 × (4 + 2) = 18-bit additional outputs.As discussed previously, the additional output is forwarded to the next S-box calculation as an additional input.Therefore, fresh randomness is required only for the first additional input and is not required during execution.Note that there are two ways of remasking in the 3rd and 4th stages.One way is ] in Definition 10.They are forwarded from a previous cycle using the temporary registers in between the 2nd/3rd and 3rd/4th stages.Another way is the additional output [Z i a , Z i b , Z i c ] in Definition 10.They are carried to the end of the pipeline and then used in another S-box calculation in the next AES round.
Here, we show that the proposed S-box implementation satisfies non-completeness.Table 3 shows the relationship between the intermediate values and the inputs.The rows represent the intermediate values: , and (ii) the outputs from the non-linear functions {t i a , t i b , t i c }.The columns represent the input shares to the S-box implementation: {x a , x b , x c }, {y a , y b , y c }, {z a , z b , z c }, {v a , v b , v c }, and {w a , w b , w c }.Note that {v a , v b , v c } and {w a , w b , w c } are the masks for the changing of the guards.The intermediate values and inputs are also indicated in Fig. 8. ♦ or ♠ in the table shows that the intermediate value depends on the input.In particular, ♠ shows that the intermediate value is either (i) masked by the input share or (ii) the input itself.
In the proposed design, the outputs from the non-linear functions are immediately refreshed by adding some of the input shares.More specifically, for any i, the stage input/output shares {X i a , X i b , X i c }, {Y i a , Y i b , Y i c }, and {Z i a , Z i b , Z i c } have distinct masks i.e., having ♠ on different columns on Table 3.Therefore, in each stage, there is no information leakage unless all the three elements of an input share of the stage are combined.However, the condition is never satisfied because each stage satisfies non-completeness.Consequently, the proposed S-box implementation satisfies non-completeness.The non-completeness ensures the security of the proposed implementation in the presence of glitches.3.More specifically, the same color is assigned to the intermediate values having the same mask.Using Fig. 8, we can verify that all the stage inputs/outputs have distinct masks.

Circuit Architecture
Fig. 9 shows the circuit for AES encryption based on the proposed S-box implementation.The design is based on the one by Moradi et al. [MPL + 11] that uses the state and key arrays as basic construction blocks.Three arrays are used to store 3-share representations of the AES state and key.Note that three key arrays are used to protect key scheduling as well.As previously discussed, the additional output is forwarded to the next AES round.To carry the additional outputs, the data width of the state and key arrays is extended from 8 to 14 bits.Accordingly, each array has 14 × 16 = 224 bits.Thus, 224 × 6 = 1, 344 bits of registers are needed in total for storing the shared representations of the state and key.The AES circuit works similarly to the original design [MPL + 11].An exception is MixColumns in which only the state elements are processed while the additional inputs are unchanged.
To convert a message into its 3-share representation, we need 128 × 2 + 6 × 16 × 2 = 448 bits of randomness.To convert a key into its 3-share representation, on the other hand, we need 128 × 2 + 6 × 4 × 2 = 304 bits of randomness because there are only four S-boxes in key scheduling.Also, 24-bit random number is needed to initialize the temporary registers in between the 2nd/3rd and 3rd/4th stages.As a result, 776 initial random bits are needed for single AES processing.
The key and plaintext are fed to the circuit in the 3-share representation with additional inputs added.The I/O ports have 42-bit width similar to the datapath.Therefore, feeding the key and plaintext requires 16 cycles.The AES round is executed in 25 cycles: 16 cycles for S-boxes, four cycles for MixColumns, one cycle for ShiftRows, and four additional cycles for pipeline latency.Single AES encryption requires 266 cycles in total.The design is implemented in HDL and synthesized using the NanGate 45-nm standard cell library [Nan] with Synopsys Design Compiler.Table 1 shows the performance evaluation.Note that the implementations in Table 1 are evaluated using different standard cell libraries.We should consider the difference of the libraries in comparing results.

Performance Evaluation and Comparison
The proposed S-box circuit uses 3.5 [kGE], which is comparable to that of conventional works.However, the size of the AES circuit is 17.1 [kGE], which is much larger than that of conventional designs.This is explained by a large number of registers.As shown in the circuit-area breakdown in Table 4, the state and key arrays occupy 76% (13.0 [kGE]) of the total circuit area.As discussed earlier, in the proposed design, the data are stored in the 3share form along with the additional inputs.Consequently, 1, 344-bit registers are required in the state and key arrays.In comparison, the designs [BGN + 15, GMK17, CRB + 16] store only two shares and thus, 8 × 16 × 2 × 2 = 512 bits are required.Furthermore, the designs [BGN + 15, CRB + 16, WM18] use unprotected key scheduling, and thus, only 384 bits are required.
Despite the large circuit area, the proposed design has an advantage over the conventional design by Wegener et al. [WM18].The latency of the proposed design is 266 cycles, which is smaller than 1/10 that of the conventional design.As Wegener et al. used area-latency trade-off for a compact circuit area, it would be fair to compare these designs in terms of the area-latency product.As shown in Table 1, the cost of the proposed design is smaller than 1/4 that of the conventional design even if the designs are compared in terms of the area-latency product.
The design has a room for further improvement.Since there are only four S-boxes in key scheduling, in the key arrays, (16 − 4) × 6 × 3 = 216 bits of the registers are wasted.The waste comes from the restriction to make the design as close as the original one [MPL + 11].By redesigning the key array, we have a room for saving 216 bits or 1.5 [kGE].Moreover, some of the conventional works use unprotected key scheduling [UHA17,WM18].Under such a design policy, we can save roughly 5.0 [kGE] by removing the two key arrays and some additional inputs.

Simulation-based Leakage Assessment
The security of the AES circuit is evaluated using a simulation-based leakage assessment.The post-synthesis simulation with back annotation is performed using the Cadence NC-Verilog logic simulator at a precision of 1 ps.The circuit is operated with a clock interval longer than a critical path delay.During the simulation, the number of 0 → 1 output transitions in all the standard cells is measured for each cycle.The measured data are used as an approximation of the dynamic current consumption measured at V DD [TV05].Thus, they are used as a simulated power trace.As switching in each standard cell is considered, the simulation captures glitches.
The test vector leakage assessment (TVLA) [BCD + 13] is conducted using the simulated power traces.For each of the fixed and variable test vectors, 100, 000 simulated power traces are obtained.Subsequently, the two sets of traces are compared with T-statistic.For comparison, the target is operated under two different conditions: (i) fully functional and (ii) randomness is disabled.For the second case, the random numbers for creating a share and additional inputs are all zero.In addition, the two mask registers in the 3rd and 4th stages are set to zero.

Conclusion
In this paper, we discussed how to construct a uniform sharing of a target mapping having different input and output sizes.We introduced two techniques namely extension and restriction.In extension, a target mapping is transformed in such a way that the extended mapping has a uniform sharing.However, it requires an additional input representing zero i.e., x a + x b + x c = 0.In restriction, the additional output obtained as a side effect of extension is transformed into a share representing 0. Subsequently, the zero share is reused as the additional input in the next sharing.By combining extension and restriction, sharing is realized without remasking.The proposed method is a generalization of the changing of the guards technique [Dae17].By applying the above methods to the Canright's AES S-box implementation, the first 3-share TI of the AES S-box without using remasking is obtained.
As shown in Table 1, the proposed AES design was larger than the conventional designs.Optimizing its performance is an important future research direction.As discussed in Sect.4.4.2, the presented design has a room for further optimization.Moreover, there is a possibility of sharing the additional inputs between consecutive S-box calculations, but the security under such optimizations remains open.Also, evaluating the proposed design with real measurement is an important future research direction.
Similarly, the distributions of the shared inputs and outputs are denoted as P I (x, y) and P O (X, Y ), respectively.As the input is distributed uniformly, we have First, we consider the case X = 0 i.e., X a + X b + X c = 0.These outputs are prohibited by construction and thus, The remaining case X = 0 i.e., X a + X b + X c = 0 is considered.In this case, where X is a set of possible values that X can take.The probability of observing an output share [X a , X b , X c ] is solely determined by From Eq. ( 15) and (18), we obtain P O (X, Y ) = P O (X,Y ) α for all the cases.Therefore, the sharing is uniform according to Eq. (3).

Proof of Theorem 1
Proof.The sharing {ψ R a , ψ R b , ψ R c } is correct because The sharing is non-complete because ψ R a , ψ R b , and ψ R c are independent of [x a , ỹa ], [x b , ỹb ], and [x c , ỹc ], respectively.{ψ E a , ψ E b , ψ E c } and {ψ ⊥ a , ψ ⊥ b , ψ ⊥ c } are uniform according to Lemma 1 and 2, respectively.Therefore, {ψ R a , ψ R b , ψ R c } being a composition of {ψ E a , ψ E b , ψ E c }, and {ψ ⊥ a , ψ ⊥ b , ψ ⊥ c } is uniform.

Proof of Corollary 1
Proof.As the sharing {f E a , f E b , f E c } is a sharing as specified in Definition 3, it is a correct, non-complete, and uniform sharing according to Lemma 1.

Proof of Corollary 2
Proof.The sharing is correct because The sharing is non-complete because, for any i, j ∈ [1, L], {X i a , Y i a , Z i a }, {X i b , Y i b , Z i b }, and {X i c , Y i c , Z i c } are independent of {x j a , y j a , z j a }, {x j b , y j b , z j b }, and {x j c , y j c , z j c }, respectively.The sharing is uniform through construction.

Figure 1 :
Figure 1: Changing of the Guards sharing

Figure 3 :
Figure 3: Construction of invertible functions from functions Fig 3-(b) shows the Feistel network, which is also a technique to obtain an invertible function (i.e., permutation) from a non-invertible F -function.The original Feistel network requires an F -function having the same input and output sizes.Fig 3-(c) and -(d) shows a generalization called the unbalanced Feistel network [SK96].The unbalanced Feistel network enables the construction of an invertible function from an F -function having different input and output sizes.Let ψ : {0, 1} n → {0, 1} m be a target function having a 3-share correct and noncomplete sharing.ψ is extended to an invertible mapping in the same manner as the Toffoli gate and Feistel network Definition 2 (Extended mapping).Let x, X ∈ {0, 1} n and y, Y ∈ {0, 1} m .The extended mapping [x, y]

Fig. 4 -
Fig. 4-(left) shows a diagram for ψ E .We further define a sharing of ψ E denoted by {ψ E a , ψ E b , ψ E c }.

Figure 4 :
Figure 4: Extension and restriction of a mapping (left) and the corresponding sharing (right)

Fig. 4 -
Fig. 4-(left) shows a diagram for ψ R .A sharing of the restricted mapping is considered.Definition 7 (A sharing of the restricted mapping).A sharing {ψ R a , ψ R b , ψ R c } s.t.
we obtain a sharing as shown in Fig. 5-(right).By construction, the sharing in Fig 5-(right) is uniform.

Figure 7 :
Figure 7: Changing of the guards sharing of h R (Definition 10)

Figure 8 :
Figure 8: Proposed 3-share TI of the AES S-box.Edges are colored based on Table3.

Figure 9 :
Figure 9: AES circuit using the proposed S-box implementation

Figure 10 :
Figure 10: Results of the simulation-based leakage assessment.Horizontal: T-statistics, vertical: cycle.Subgraphs (i) and (ii) correspond to two different operating conditions.

Fig. 10 (
i) and (ii) show the traces of T-statistics.The horizontal and vertical axes represent the cycle and T-statistics, respectively.In the fully functional case in Fig. 10-(i), the obtained T-statistics fit within the range [−4.5, 4.5].In Fig. 10-(ii), the T-statistics are far above and beyond the borders.The results show that the target AES circuit passes the leakage assessment if it is fully functional.

Table 1 :
[WM18]mance evaluation of the proposed AES implementations and comparison with the performance of conventional designs with 1st order security.The entries with † are based on[WM18].
Nikova et al. proposed an MPC-based countermeasure called threshold implementation (TI)

Table 3 :
Data propagation in the proposed S-box implementation.Row: intermediate values, column: inputs.The names of intermediate values and inputs follow Fig. 8. ♦/♠ is placed if the intermediate result depends on the input.♠ is placed if the intermediate result is masked by the input.

Table 4 :
Circuit-area breakdown of the proposed AES circuit