Multiplicative Masking for AES in Hardware

. Hardware masked AES designs usually rely on Boolean masking and perform the computation of the S-box using the tower-ﬁeld decomposition. On the other hand, splitting sensitive variables in a multiplicative way is more amenable for the computation of the AES S-box, as noted by Akkar and Giraud. However, multiplicative masking needs to be implemented carefully not to be vulnerable to ﬁrst-order DPA with a zero-value power model. Up to now, sound higher-order multiplicative masking schemes have been implemented only in software. In this work, we demonstrate the ﬁrst hardware implementation of AES using multiplicative masks. The method is tailored to be secure even if the underlying gates are not ideal and glitches occur in the circuit. We detail the design process of ﬁrst-and second-order secure AES-128 cores, which result in the smallest die area to date among previous state-of-the-art masked AES implementations with comparable randomness cost and latency. The ﬁrst-and second-order masked implementations improve resp. 29% and 18% over these designs. We deploy our construction on a Spartan-6 FPGA and perform a side-channel evaluation. No leakage is detected with up to 50 million traces for both our ﬁrst-and second-order implementation. For the latter, this holds both for univariate and bivariate analysis.


Introduction
Cryptographic primitives are designed to resist mathematical attacks such as linear or differential cryptanalysis.The designer typically assumes a classic adversarial model, where encryption is treated as a black box, only revealing inputs and outputs to adversaries.When these primitives are deployed in embedded devices, unintended signals such as the instantaneous power consumption or electromagnetic radiation leak sensitive information, effectively turning the black box into a gray box.Side-channel analysis is a cheap and scalable technique that allows the adversary to exploit these signals and extract secret keys or passwords.Hence, cryptography deployed into embedded devices needs not only mathematical but also physical security.
One particularly powerful attack, differential power analysis (DPA) was introduced in 1999 by Kocher et al. [KJJ99].In this type of attack, the adversary feeds different plaintexts to an encryption algorithm using the same key and extracts sensitive information from the power traces he collects.Today, we aim at providing security against d th -order DPA.In a d th -order DPA attack, the adversary exploits any statistical moment of the power consumption up to order d.Since statistical moments are exponentially harder to estimate with the order d given sufficient noise (both in terms of numbers of samples and computational time), having a moderate security target d = 1, 2 often suffices in practice, especially when used in conjunction with complementary countermeasures [HOM06,CCD00].
In a side-channel secure implementation, the goal is to make the leakages of the values handled in the implementation independent of the sensitive inputs and sensitive intermediate variables.At the architectural level this is typically achieved by masking, which means the processed data is probabilistically split into multiple shares in such a way that one can only recover the sensitive data if all of its shares are known.Recovering secrets from shares is exponentially more difficult as noise increases; as this corresponds to estimating higher-order statistical moments with increasing noise levels [CJRR99,GP99].
Previous Work.The earliest masking schemes [GP99,Tri03,ISW03] were shown to be unsuitable for hardware implementations by Mangard et al. [MPG05,MPO05].The vulnerability arises when unintended transitions of a signal or glitches occur, caused by non-idealities such as logic gates with non-zero propagation delays or routing imbalances.The glitches problem can be addressed at many levels: either by equalizing signal paths (which normally requires manual access to low-level routing details and a careful characterization of the logic library), by adding synchronization elements (such as registers or signal gating) or by using a masking scheme that is inherently secure under glitches.Extensive research has been done on countermeasures based on secret sharing and multiparty computation that are provably secure even in the presence of glitches.The prevailing schemes are those of Prouff and Roche [PR11] and Threshold Implementations (TI) by Nikova et al. [NRS11] which use polynomial and Boolean masking respectively.The latter was extended to higher-order security by Bilgin et al. (higher-order TI) [BGN + 14a].The similarities and differences between TI and the Private Circuits scheme [ISW03], which provides provable security if the circuit behaves ideally (no glitches), were analysed by Reparaz et [GMK16], which is also related to the original Private Circuits scheme [ISW03] with additional registers againts glitches and a different randomness consumption.These masking schemes have all been applied to Canright's tower-field AES S-box [Can05] due to its small foot-print and structure, resulting in a multitude of masked AES implementations [MPL + 11, BGN + 14b, CRB + 16, GMK17, UHA17].Those of Ueno et al. [UHA17], De Cnudde et al. [CRB + 16] and Gross et al. [GMK17] are the smallest to date, with the latter requiring much less randomness.
In this paper we follow a different avenue.We do not apply Boolean masking to Canright's tower-field decomposition, but instead, we revisit the well-known concept of switching between different types of masking.Boolean masking schemes are compatible with linear operations but difficult to work out for non-linear functions.Akkar and Giraud [AG01] were the first to propose an adaptive masking scheme for AES at CHES 2001.The idea is to use Boolean masks for the affine operations and multiplicatively masked values for multiplications (or in the case of AES, inversion) and convert between the two types when necessary.At CHES 2002, Golić and Tymen [GT02] presented an inherent weakness of multiplicative masking, namely that it is vulnerable to first-order DPA because the zero element cannot be effectively masked multiplicatively.As a solution to this zero problem, they proposed to map each zero element to a non-zero element.The adaptive masking scheme was studied in depth and extended to higher-order security by Genelle et al. [GPQ11b].So far, it has only been used in software implementations.
Our Contribution.We present the first hardware implementation of an adaptively masked AES.We describe glitch-resistant modules that convert between Boolean and multiplicative masking and that attend to the zero problem, based on the algorithmic descriptions provided for software in [GPQ10,GPQ11a,GPQ11b].While this work focuses on the AES S-box, the methodology can be used to mask any inverse or power map-based S-box [AGR + 16].
We optimize the number of inversions used and the randomness cost for first-order and second-order resistant AES, which both achieve a smaller area than the current state-ofthe-art masked hardware AES implementations of [CRB + 16] and [GMK17], while having comparable randomness and latency requirement.We formally discuss the security of our S-box and its components up to the level current state-of-the-art tools and methods allow.We also deploy our implementations into an FPGA for side-channel evaluation using a non-specific leakage assessment test to analyse practical security in a lab environment with low noise.No leakage is detected with up to 50 million traces, confirming that the security claims hold empirically.

Preliminaries
Notation.Multiplication and addition in the field F q = GF(2 k ) are denoted by ⊗ and ⊕ respectively.We use & for multiplication in the field GF(2) (i.e. the AND operation).For ease of notation, we sometimes omit ⊗ and &.Square brackets [•] in formulas indicate where synchronization via registers or memory elements are used.An element r ∈ F q drawn uniformly at random from F q is shown as

Adversarial Model
We consider a physical adversary model, in which an attacker can probe and observe up to d intermediate wires in each time period.This model is known as the d-probing model [ISW03].To account for non-ideal (glitchy) circuits, we assume that any probed wire carrying a function output also leaks information about all function inputs up to the last register [RBN + 15].It has been shown in [FRR + 10, RP10, DDF14] that security in the d-probing model implies security against d th -order DPA as well given the independent leakage assumption of each share and its corresponding logic from the others.

Boolean and Multiplicative Masking
A popular countermeasure against d th -order DPA is masking sensitive values by probabilistically splitting them into d + 1 shares.Let be some group operation.Then for any x ∈ F q we process the sharing x = (s 0 , . . ., s d ) with s 0 s 1 . . .
Masked representations.We can distinguish different masked representations based on the splitting operation .A common choice is the exclusive-or operation ⊕, resulting in a Boolean sharing.We use b x i to denote Boolean shares of x: i.e.
In this paper we also use multiplicative sharing, which in a side-channel context is typically defined as We refer to this sharing as a type-I multiplicative sharing.We further define a type-II multiplicative sharing: This notation is more common in secret-sharing.We omit the superscript x when it is clear from context.

Masked operations.
In Boolean masking, linear operations can trivially be applied locally on each share: Non-linear operations such as a multiplication on the other hand are less straightforward and much more costly to implement.The opposite situation arises if one uses multiplicative masking.In that case, linear operations are non-trivial but multiplication is local: Finding an efficient but glitch-resistant way to process Boolean shares in a non-linear operation has been a hot topic in the last years.A natural strategy is to switch back and forth between masked representations and perform each operation in its most compatible setting.
The zero-value problem.The fundamental security flaw of multiplicative masking was first pointed out by Golić and Tymen [GT02].Multiplicative masking cannot securely encode the value 0. The mean power consumption of a single share p x i reveals whether the underlying secret is zero or non-zero, since for any share index i.This means that for any number of shares, the original multiplicative masking scheme is vulnerable to first-order DPA.

Masking in Hardware
Masking in hardware requires special care.The seminal work of Mangard et al. [MPG05,MPO05] showed that glitches can reveal sensitive information in hardware masked implementations that otherwise were expected to be secure.

Non-completeness.
The concept of non-completeness appears in the work of Nikova et al. [NRS11] and follow-up works on higher-order security [BGN + 14a, RBN + 15].Noncompleteness between register stages has become a fundamental property for constructing provable-secure hardware implementations even if the underlying logic gates glitch.We recall here the definition of non-completeness: for any shared implementation f operating on a shared input x, d th -order non-completeness is satisfied if any combination of up to d shares of f is independent of at least one input share.+ 15] showed that a d th -order masked multiplication in hardware can be constructed using only d + 1 shares if the sharings of the inputs are independent (so as to not break non-completeness).One approach to do this is detailed in [GMK16] and is referred to as Domain Oriented Masking (DOM).

Masked Multiplier. Reparaz et al. [RBN
Our work uses as a masked AND gate the DOM-indep multiplier from [GMK16].Let x = (b x 0 , b x 1 ) and y = (b y 0 , b y 1 ) be first-order Boolean sharings of bits x and y.A sharing of the multiplication result z = x&y is obtained by first calculating four partial products t ij = b x i &b y j , i, j ∈ {0, 1} as in [ISW03].When i = j, t ij is called a cross-domain term and must be refreshed with a randomly drawn bit r $ ← GF(2).After a register stage for synchronization, the shares (b z 0 , b z 1 ) are computed. (1) The second-order multiplier uses three bits of randomness r $ ← (GF(2)) 3 .The inputs and outputs have three shares and there are nine partial products t ij .
Note that we employ the special version of the DOM-indep multiplier where only the cross-domain terms are synchronized in registers.For efficiency, these registers are clocked on the negative edge as is done in [GSM17].This is illustrated for the first-order multiplier in Figure 1.

Design of an Adaptively Masked AES S-box
The AES S-box is an inversion in GF(2 8 ), followed by an affine transformation over bits.We adopt the idea of adaptive masking, where we use Boolean sharings for linear operations and multiplicative masks for non-linear operations.We thus implement the inversion by first converting the input from Boolean to multiplicative masking.The inversion then becomes a local operation on the individual shares: In what follows, we first describe the conversion circuits between Boolean and multiplicative masking.We address the zero problem in § 3.3.An overview of the S-box can be found in Figure 5.While this section is written with AES in mind, the methodology can be applied to any S-box constructed from inversion or another power map in F q .

Masking Conversions
Following the strategy of [GPQ11b], we intuitively describe a higher-order conversion between Boolean and multiplicative shares with the following steps.Note that this description is not final and we will deviate from them slightly in § 3.2.
For k = 1, . . ., d: (a) Expansion: extend the sharing x with a new share of the target masking type.
The number of target shares is augmented by one and the total number of shares is now d + 2. (b) Synchronize the shares in a register (c) Compression: Remove one share from the source sharing by partially unmasking.
The number of source shares shrinks by one and the total number of shares is again d + 1.
Boolean to Multiplicative.More specifically, consider a conversion from Boolean to type-I multiplicative shares.After k iterations of the above steps, we have an intermediate sharing The number of target (multiplicative) shares is k and the number of source (Boolean) shares is d + 1 − k.In the expansion phase, we add a new multiplicative share by drawing a random p k and multiplying it with all Boolean shares: We now obtain a d + 2 sharing In the compression phase, multiplicative share q d−k is removed by multiplication with all Boolean shares: b with k + 1 target (Boolean) shares and d − k source (multiplicative) shares.
We provide high-level descriptions for both conversions in pseudocode below.These pseudocodes are slightly different from the higher-order generalizations in [GPQ11b] (Algorithms 1 and 2) but representative of their first-and second-order descriptions.

Algorithm 1 Boolean to Multiplicative
Conversions in Hardware: Dealing with glitches.The register stage between the expansion and compression phases is necessary because of the presence of glitches in hardware circuits.Without this register, the non-completeness of the conversion is broken and we have no security guarantees.Consider for example equations (2) and (3).Together, they compute the following Without a register, the signal p k might arrive late to the multiplication.As a result, two of the shares of x are combined on one wire b k ⊕ b k+1 and the security is reduced by one order.

Specific Inversion Circuits
Why we use two types of multiplicative masking: Consider a type-I multiplicative masking, i.e.
To obtain a type-I masking of its inverse x −1 , we can locally invert all shares p x i using d + 1 unshared F q inverters.Converting back to Boolean masking then requires d more F q inverters.However, the following formula shows that we can do the entire masked inversion with only one unshared F q inverter: Indeed, by only locally inverting the last share p x d of a type-I multiplicative masking of x, we obtain a type-II multiplicative sharing of its inverse x −1 : Note that regardless of the security order d, only one unshared inverter is required this way.
We now look in more detail at the first-and second-order implementations of the conversions.
First-order.The complete first-order masked inversion including the resulting circuits for first-order conversions between Boolean and multiplicative masking is shown in Figure 2. The left side of the figure converts a Boolean sharing x = (b 0 , b 1 ) to a type-I multiplicative sharing (p 0 , p 1 ) such that x = p −1 0 p 1 .With a non-zero r 0 $ ← F * q , the multiplicative shares are calculated as The right side of the circuit converts a type-II multiplicative masking of x −1 into a Boolean masking.This requires another random r These procedures are identical to those described in Algorithms 1 and 2.
Figure 2: First-order shared implementation of an inversion in F q .The dashed lines depict registers.
Second-order.Adopting the same algorithms for d + 1 = 3 shares does not provide second-order secure conversions (see Appendix A).We require an extra refreshing of additive shares.Figure 3 depicts our circuit for the second-order shared inversion in F q .
The conversion from a Boolean to a type-I multiplicative sharing is depicted on the left side of the figure.The conversion requires three units of randomness: r 0 , r 1 $ ← F * q and the extra refreshing u $ ← F q .The multiplicative shares are as follows: For the opposite conversion (shown on the right side of Figure 3), we start from a type-II multiplicative masking.This means we only need to invert the last share, p 2 .We calculate the Boolean shares of x −1 as The conversion again uses three units of randomness, r 2 , r 3 , u $ ← F q , although we can recycle the refreshing mask u from the Boolean to multiplicative conversion.Each conversion thus uses only 2.5 units of randomness.
Figure 3: Second-order shared implementation of an inversion in F q .The dashed lines depict registers.
Our procedures differ slightly from those of Genelle et al. [GPQ11b], especially in the smaller use of randomness (we expand on this in Appendix A).For a general randomness strategy for higher-order conversions, we refer to [GPQ11b], but we note that their randomness cost is not necessarily optimal for each target security order d.A custom approach can result in a lower cost.

The Zero Problem
We now describe how to circumvent the zero problem of multiplicative masking.Both in MPC literature [DK10] and in software masking [GPQ10], it has been proposed to map each zero element in F q to a non-zero element in F * q using a Kronecker Delta function before converting to multiplicative masks.
In the AES S-box, we need to do an inversion in F q .Both the zero and unit element of F q are their own inverses: x It is therefore sufficient to replace each zero element by a "one" before the inversion and change it back afterwards.Consider a Kronecker delta function δ(x): We can write the inversion of any x ∈ F q as follows: We thus require a circuit that computes a shared Kronecker delta function δ(x).Its output (a sharing of "zero" or a sharing of "one") is to be added to the input of the conversion from Boolean to multiplicative masking and to the output of the conversion from multiplicative to Boolean masking (see Figure 5).This way, any zero element goes through the F q inversion as a "one" and is thus never shared multiplicatively.
The Kronecker delta function δ(x) can be calculated with an n-input AND, or equivalently, a log 2 (n)-level 2-input AND tree with the inverted bits of x as input: The circuit is shown for n = 8 in Figure 4 with x i a sharing of the i th bit of x.In software, it has been realized using masked table lookups [GPQ10] and bit-slicing [GPQ11a].We implement each AND gate with a DOM-indep multiplier [GMK16].We denote by r j the randomness needed for each gate.As each multiplier requires one register stage, the entire circuit of Figure 4 takes three clock cycles (regardless of the number of shares).We note that a trade-off can be made here between latency and area.It is possible to reduce the depth of the tree (and thus the number of clock cycles) at the cost of a larger fan-in for the AND gates, which results in a considerable increase in area for shared implementations.In this paper, we choose to work only with 2-input AND gates in order to minimize circuit area.
First-order optimizations.In a straightforward first-order secure implementation of δ(x), each input bit has two shares and each DOM-AND gate requires 1 extra random bit r j $ ← GF(2).The circuit thus receives a total of 23 bits.That is a lot of entropy for a function that outputs only 2 bits.In order to bring down the randomness cost of the circuit, we decide to recycle some of the bits across the multiplication gates.A theoretical framework for this was presented in [FPS17].Following this would result in a total randomness cost of 5 units: one bit in each of the three layers and one bit each for the refreshing after layer 1 and after layer 2. We now push the cost even further by using custom optimizations.
We rewrite the DOM equations (1) and note that they have a special property: The DOM gate thus uses its inputs somewhat asymmetrically since the output shares depend only on the unmasked second input y and not on its sharing.This means that any randomness that has been used to mask y before arriving at this gate, disappears from its output sharing z.Hence, we can reuse this randomness in the next layer.In our case, we use the more significant bit (depicted as the lower input to an AND gate in Fig. 4) as the "second input" and we conclude that the second layer of the Kronecker implementation removes any dependence of the data on r 2 and r 4 .In contrast, reusing r 1 (or r 3 ) in layer two is not advisable.Moreover, for a first-order implementation (only univariate matters), the upper and lower two gates in the first layer have independent inputs and outputs, and can therefore use the same randomness as long as layer two does not.
We propose the following use of randomness: We are thus able to reduce the randomness consumption of the first-order Kronecker delta implementation from 7 to only 3 bits.We refer to Appendix C for the probability distributions of intermediate and output wires of this circuit with our randomness optimization.We verified that these probability distributions are independent of the secret input.Moreover, we note that these probability distributions are the same as in the circuit without randomness optimization.
Second-order optimizations.A second-order implementation uses three bits of randomness per multiplication: r j = (r j0 , r j1 , r j2 ) $ ← (GF(2)) 3 .Again, instead of consuming 21 bits of extra randomness in the circuit, we propose a recycling of the bits.Following the framework of [FPS17] would require five groups of three fresh random bits, i.e. 15 bits.Our customization is more restricted in the higher-order case because of the possibility of multivariate leakage.We still have the special composability property of the DOM gates, but the gates in the first layer can no longer be considered independent.We propose the following: We thus reduce the randomness consumption of the second-order Kronecker delta implementation from 21 to 13 bits.The probability distributions of relevant (pairs of) wires can again be found in Appendix C.

The S-box
We summarize the AES S-box circuit in Figure 5.The local inversion is based on the smallest unshared AES S-box implementation to date by Boyar, Matthews and Peralta [BMP13].More details on our adaptation of this circuit are given in Appendix B. The registers are depicted with grey dotted lines.In a first-order implementation each conversion has a latency of one cycle, whereas in a second-order implementation, it is two clock cycles.The S-box input needs to be fed to the δ(x) circuit three clock cycles before the first conversion.This could cost us three cycles of S-box latency as well as three stages of 8 × (d + 1)-bit registers.Instead, we reorganize the state array and key schedule such that the Kronecker delta function can be precomputed.We describe this in the next Section.

AES Architecture and Control
The ShiftRows, MixColumns and AddRoundKey stages in AES are all linear and thus trivially masked by instantiating d + 1 copies, one for each share of the state and key schedule.Following previous masked AES implementations, we use a byte-serialized architecture with a pipelined S-box as shown in Figure 5.Note that instead of the serialized architecture from [MPL + 11], we use a similar architecture to that of [GMK16, Fig. 5] since it exhibits a more compact and efficient datapath.We adapt [GMK16] to accommodate for our S-box that needs a three-cycle precomputation of the Kronecker delta function.

State Array
The byte-serialized architecture from [GMK16] is very efficient in terms of clock cycles, since it performs the MixColumns, ShiftRows and AddRoundKey operations in parallel to SubBytes.Figure 6 (left) shows the state array with its normal meandering movement during the SubBytes operation in black full lines and the ShiftRows functionality in blue dotted lines.The column of registers that is the input of the MixColumns operation is indicated by a red striped frame, whereas the registers receiving the output of MixColumns once cycle later are specified by a full red frame.
The S-box input is taken from State 00, while the Kronecker delta input starts computing three cycles beforehand on State 30.In order to have State 30 ready for the Kronecker function, we have to put the MixColumns operation in the second column (instead of the first column as in [GMK16]).ShiftRows is performed when the sixteenth and last S-box output enters the state.We also adapt the ShiftRows connections such that all bytes end up one column to the right of the actual ShiftRows result.This means that the normally first column is the first MixColumns input (state bytes 01,11,21,31) and the normally last column now occupies state bytes 00,10,20 and 30.During the next four clock cycles, we rotate the state by returning byte 00 to the state input (33) untouched.After those four cycles, the state columns are restored to their correct order and the first S-box input is  The result is fed back into the key state as Key 33.

Control
We now go into more detail on the scheduling of the 24 clock cycles (0 to 23) that make up one encryption round when the S-box latency is four cycles (as in our second-order implementation).Table 1 details the control of the register movement and Table 2 shows how various inputs to the states and the S-box change.The 16 bytes of the state register are fed to the S-box in cycles 3 to 18 of each round of encryption.This means the Kronecker delta function receives the same 16 bytes three cycles before that: in cycles 0 to 15.During these cycles, the key state follows its meandering movement and Key 00 is used to construct the Round Key byte.In the remaining clock cycles (from cycle 16 until cycle 23), the key array is rotating.The last column of the array is fed through the Kronecker delta function in cycles 17 to 20 and through the S-box in cycles 20 to 23, which means their outputs are ready for the first four Round Key calculations four cycles later: in cycles 0 to 3.
The state receives its S-box outputs in cycles 7 to 22.In the last cycle ( 22  The first round of encryption (loading of the inputs) starts in cycle 0 with the data and key inputs replacing respectively State 30 and the Round Key.In total, one AES encryption is obtained in 10 × 24 + 16 = 256 cycles.Our first-order AES implementation has the same latency in spite of the S-box requiring only two cycles.Given the AES design, it is difficult to exploit an S-box latency below four cycles.

Security Evaluation
In this section, we elaborate on the security of the first-and second-order AES constructions against a probing adversary in the presence of glitches.Neither formal proofs in a particular security model nor empirical leakage detecting tools can in their own capacity provide full evidence for security.A security evaluation is incomplete without complementary analyses following both methodologies.Therefore, our approach consists of three stages: first in § 5.1, we address the security of the S-box under the ideal circuit assumption using the notion of strong non-interference [BBD + 16, BBP + 16].Next in § 5.2, we evaluate the security of the S-box in the presence of glitches, using leakage detection tools available in literature.Finally in § 5.3 we complete the evaluation by analyzing our whole circuit on a physical device.

Security of the S-box in a theoretical framework
We now use the concept of Strong Non-Interference (SNI) [BBD + 16] to prove that the S-box construction is theoretically secure.We use the same methodology as the proof of [BBD + 16, We further define S i as the set of shares that are required at the input of block A i in order to be able to simulate the probes in the remainder of the circuit, i.e. i j=1 I i ∪ O.We subsequently treat this set as a set of probes that needs to be simulated using input shares from a previous block A i−1 .This way, we gradually move towards the input and try to show that the number of input shares of x required to simulate all probes Consider for example block A 4 in Table 3.This block has output z and input y.The set of shares of z, S 3 is constrained by we have that the number of shares of y required to simulate S 3 ∪ I 4 is at most |I 4 |.We call this set of shares S 4 .Now, since we are able to simulate S 3 using S 4 and since S 3 is able to simulate the remaining probes 3 i=1 I i ∪ O, we know that the set of shares S 4 is sufficient to simulate Table 3 shows that we need | shares of the input to simulate all d-tuples of probes in the circuit.This proves that the S-box is d-SNI.

Practical Evaluation of Glitch Security of the S-box
A useful property for the synthesis of secure circuits in the presence of glitches is noncompleteness [NRS11].We use the VerMI tool described in [ANR17] to verify the security of the gadgets that create the S-box, i.e. the conversions and the Kronecker delta.This tool was designed specifically for masked hardware implementations.In particular, it can verify if a circuit satisfies the non-completeness property from register to register.By applying this tool directly to the RTL HDL descriptions of our gadgets, we confirm that each stage is non-complete and therefore secure in the univariate setting in the presence of glitches if the shared input does not have a secret dependent bias.We verify this condition on the input sharing independently (Appendix C).
We note that it has been implied that verifying glitch security and strong noninterference separately does not guarantee composability in a glitchy environment [FGP + 17].In section 5.1, we have given security proofs for the S-box as best as we could with the tools at our disposal.In this section, we consider glitches.The combined theoretical verification of "glitchy" SNI is an interesting direction for future research.However, note that SNI is not a necessary condition for the S-box to be secure.We further evaluate the security of the entire S-box using state-of-the-art tools.
We use the simulation tool of [Rep16], in which we exhaustively probe the S-box and create power traces using an identity leakage model.These traces do not only contain explicit intermediates (stabilized values on wires) but also values that could be observed in a glitch (transient values on wires).We exhaustively probe the S-box in this way in a completely noiseless setting and create up to 100 million simulated traces.For more details, we refer to [Rep16].We detect no univariate leakage with up to 100 million traces nor bivariate in the case of our second-order gadgets.We draw the same conclusions when using the tool described in [DBR18].This tool essentially exhausts every possible glitch in the computation by verifying that there is no mutual information between the secret and all possible (pairs of) glitch-extended probes.
While the theoretical possibility of a very weak bias still exists we would need more than 100 million traces to detect it and thus the practical implications of this are thin: if the leak is not even detected with 100 million traces in a noiseless scenario, it would take even considerably more traces to exploit it (perform key-recovery) in a realistic noisy scenario.

Physical Evaluation
After evaluating the S-box both theoretically and empirically in simulation, we finally put our entire AES design to the test in a physical environment.
Setup.We program a Xilinx Spartan6 FPGA with both our first-and second-order design on a SAKURA-G board, specifically designed for side-channel evaluation.For the synthesis, we use the Xilinx ISE option KEEP_HIERARCHY to prevent optimization across modules (and in particular across shares).To minimize platform noise, we split the implementation over a crypto FPGA, which handles the AES encryption and a control FPGA, which communicates with the host computer and supplies masked data to the crypto FPGA.The FPGA's are clocked at 3.072 MHz and sampled at 1GS/s.
The crypto FGPA is also equipped with a PRNG to generate the randomness required in every clock cycle.This PRNG is loaded with a fresh seed for every encryption.In contrast with other state-of-the-art masked implementations, we have to be able to generate one or two non-zero bytes for the multiplicative masks.We refer to Appendix F for a description of how we achieve this in practice, without stalling the pipeline.
Univariate.We perform a non-specific leakage detection test [BCD + 13] using the methodology from [RGV17].This means we gather power traces in two sets: the first corresponding to encryptions of a fixed plaintext and the other to encryptions of random plaintexts.We choose the fixed plaintext equal to the key in order to test the special case of zero inputs to the S-box in the first round.Nonzero S-box inputs then occur in encryption round two and are thus naturally also tested.The two sets of measurements are compared using the t-test statistic.When the t-statistic at order d crosses the threshold T = ±4.5, the null hypothesis "The design has no d th -order leakage" is rejected with confidence > 99.999%.On the other hand, when the t-statistic remains below this threshold, we corroborate that side-channel information is not distinguishable at order d.The results for our first-order design are shown in Figure 8.Each trace consists of 64 clock cycles, comprising about two and a half rounds of encryption.An example of such a trace is shown in Figure 8, top.To verify the soundness of our setup, we first perform the leakage detection test with the PRNG turned off (i.e.unmasked implementation).This is shown in the left column of the figure and as expected, the design presents severe leakage at only 12 000 traces.On the right side, we do the leakage detection test with the PRNG turned on.We do not observe evidence for first-order leakage with up to 50 million power traces.The design does leak in the second order, as anticipated.
Similarly, we show the test results for our second-order design in Figure 9.The leakage when the PRNG is turned off (left column) is clear.The masked implementation (right  column) does not present evidence for first-nor second-order leakage with up to 50 million power traces.While we would expect the third-order t-statistic to surpass the threshold, this is not yet the case due to platform noise.We also track the evolution of the maximum absolute t-test value as a function of the number of traces taken.This is shown in Figure 10 for the first-order (left) and second-order (right) protected AES implementations.On the left, we clearly see an increase in the absolute t-value of the second-and third-order moment, while the statistic for first order is stable.For our second-order implementation, the noise of the platform prevents us from seeing evidence for third-order leakage.

Bivariate.
In order to do a bivariate leakage detection test, we reduce the length of the power traces to 15 clock cycles and the sample rate of the oscilloscope to 200MS/s.Each trace then consists of 1 000 time samples.In order to reduce the signal-to-noise ratio, we make the traces DC free.We then combine the measurements at different time samples by doing an outer product of the centered traces with themselves.The resulting symmetric matrices are the samples for our t-test.
We first perform this experiment on the first-order protected AES implementation to verify if we can indeed detect bivariate leakage.The resulting t-statistic after 1 and 45 million traces is shown in Figure 11 and confirms that our method is sound.
Next, we do the same for the second-order masked AES implementation.We collect 50 million traces and show the resulting t-statistic in Figure 12.The result shows clearly that no bivariate leakage can be detected with 50 million traces.

Implementation Cost
We presented first-and second-order secure constructions for AES and evaluated their security.In this section we investigate the implementation cost and compare it to the stateof-the-art AES designs of [CRB + 16] and [GMK17].All area measures were obtained with the Synopsis Design Compiler v.2013.12,using the Open Cell Nangate 45nm library [NAN] and are expressed in 2-input NAND gate equivalents1 .We use compile option -exact_map to prevent optimization across modules.For a fair comparison, we also synthesize the implementations of [CRB + 16] and [GMK17] with the same library and toolchain.From the latter, we picked the options for smallest area, i.e. not perfectly-interleaved and the eight-stage S-box.Both these works create a shared implementation from Canright's compact AES S-box [Can05] using the tower-field method.Our approach is thus radically different.We cannot compare easily with [UHA17] because of different synthesis libraries, though they seem to have a similar area footprint for larger randomness requirement (64 bits per S-box).Also, they only provide a first-order implementation.We first detail the cost of the S-box only in § 6.1 and then look at the entire AES encryption in § 6.2.Table 4 shows our implementation results for the S-box.Our S-box implementations are the smallest to date among state-of-the-art schemes with similar randomness and latency with an area reduction of 29% for first order and 18% for second order.

AES
Table 5 shows the implementation results of our entire AES implementations in comparison with those of De Cnudde et al. [CRB + 16] and Gross et al. [GMK17].Our S-box area reduction results in an overall improvement of around 10% over the state-of-the-art with comparable or even better randomness consumption and latency.

Conclusion
We have ported the well-known concept of adaptively masking ciphers such as AES to hardware.The idea has been extensively studied in software, but had not yet been applied in hardware up till now.We show that this methodology is a very competitive alternative to state-of-the-art masked AES designs.Our approach is conceptually simple, yet incorporates modern countermeasures to mitigate the effect of glitches in hardware.Specifically, we present secure circuits for converting between Boolean and multiplicative masking and for circumventing the well-known zero problem of multiplicative masking.We apply the methodology to the AES cipher for first-and second-order security and show with experiments that our implementations do not exhibit univariate or multivariate leakage with up to 50 million traces.Our AES S-box implementations require comparable randomness and latency to state-of-the-art implementations and yet achieve an 18 to 29% smaller chip area.We believe this is an interesting addition to the hardware designer's toolbox.

A On optimized second-order conversions
When adopting the conversion procedures described in § 3.1 for d = 2, an additional Boolean refreshing u is required to obtain second-order security (see Figure 3).Genelle et al. propose mask conversion procedures tailored for software implementations that aim at providing higher-order security [GPQ11b].The conversions require a number of additive refreshing masks: (d−1)d 2 units for Boolean to Multiplicative and d(d+1) 2 for Multiplicative to Boolean.The authors suggest that one can ommit these extra refreshings when d = 2 and still maintain second-order security [GPQ11b, p. 246], both for Boolean to Multiplicative and vice-versa.Here we will see that the "optimized" variants exhibit second-order leaks and thus additional randomness is needed to achieve second-order security.

A.1 Boolean to Multiplicative
Following the basic recipe for converting three Boolean shares to multiplicative shares results in the circuit in Figure 13.The same conversion is initially proposed by Genelle et al..
by the red stars in Figure 13).We will see that the pair (V 1 , V 2 ) jointly leak information on the sensitive input value x in the second statistical order.As a result, the value

B Inversion circuit
The AES S-box circuit from Boyar, Matthews and Peralta [BMP13] is the smallest to date, even beating Canright's tower-field one.The circuit consists of three parts: S = B • F • U ⊕ 0x63 with U, B linear and F non-linear.As we are only interested in the inversion part of the S-box, we adopt only F and U and add our own linear layer to obtain the inversion output x −1 0 , x −1 1 , . . ., x −1 7 .We provide only the linear equations of the new block here.For F and U we refer to [BMP13, Fig. 10 and 11].In Figure 15 we show again the AND tree that implements the shared Kronecker Delta function with randomness optimizations from § 3.3 and we indicate with red dotted lines the stages where we place our probes.At each probe, we compute the probability distribution of the wire for each possible value of the secret x and verify that the distribution does not vary with the secret.We do the same for each pair of probes in the case of the second-order implementation.We distinguish A stages, in which we target the cross products t ij of the DOM multipliers and B stages, which contain the multiplication results.Note that the A-stage probes are the cross products before any randomness is added.
One Probe.If we look only at individual probes (first-order) in either the first-or secondorder implementation, we find that all B-stage wires are uniformly distributed for each secret.For each of the cross products in the A stages, we find a non-uniform distribution ] with itself, it means such a pair of probes is statistically independent.In contrast, let i = j, j = k and i = k; then when we probe two cross products (t ij , t ik ) or (t ij , t kj ) in the same multiplier, we obtain the probability distribution [ 5

D Strong Non-Interference of Conversions
In this section, we prove the strong non-interference of the conversions between Boolean and multiplicative masking.We cannot use the tool of [Cor17] since it is incompatible with the use of our multiplicative operations.An important substitution rule from [Cor18] is that an XOR with a random r i $ ← F q serves as a one-time pad when r i is not used in another part of the probe: However, extending this substitution rule to field multiplication is not straight-forward.In general, the multiplication of a secret field element x ∈ F q with a random variable r i cannot be simulated by r i because of the non-uniform mapping of zeroes in a multiplication.However, if at least one of the multiplicants is nonzero, the random value does play the role of a one-time pad.Therefore, we define and use the following rule: This rule is valid whether In what follows, we show how to simulate all d-probes in the conversion circuits using only |I| input shares, where I is the set of intermediate probes.It can be seen that for any field multiplication, at least one of the operands is nonzero in our setting.We thus show that the conversions are d-SNI for d ∈ {1, 2}.Table 6 shows the proof for d = 1 (Figure 16) and Tables 7 and 8 for d = 2 (Figure 17).For readability, we do not attempt to simulate when the probe(s) themselves already depend on only |I| input shares.

|I| Probes Simulation using
Multiplicative to Boolean:

F Nonzero Randomness
Our first-order masked AES requires 19 bits of fresh randomness for each S-box calculation.For this purpose, we instantiate an implementation of the stream cipher Trivium [Can06], which provides 19 bits in parallel each clock cycle.2Of these 19 bits, one byte serves as a new multiplicative mask r 0 and must therefore be nonzero.The probability that we end up with an unusable mask is 2 −8 .Since the S-box is used 200 times per encryption (10 rounds with each 16 state bytes and 4 key bytes), we (over)estimate this event happening roughly once per encryption.We do not want to stall the pipeline until the PRNG generates a nonzero byte.Recall from Table 2, that the S-box receives an input in only 20 out of 24 clock cycles.This means that there are four cycles in each encryption round during which we are generating but not using 19 bits of randomness.This is more than enough to create a set of backup nonzero bytes in for example a FIFO.The size of the FIFO should depend on how many zero bytes we expect to see in one encryption round.Naturally, bytes are verified to be nonzero before being put in the FIFO.
We can model the number of PRNG failures X (= # zero bytes) over n = 20 trials with a binomial distribution with probability p = 2 −8 .
The expected number of failures is then simply E[X] = np = 0.078.A FIFO depth of only two or three bytes should thus more than suffice.
A similar approach can be used for the second-order implementation, in which 53 bits of randomness are required each cycle, of which two bytes must be nonzero.

x
= (p 0 , . . ., p k , b k , . . ., b d ) where x = In the compression phase, we remove Boolean share b k by adding it to another Boolean share b k+1 : b k+1 = b k ⊕ b k+1 (3) which brings us to a d + 1 sharing x = (p 0 , . . ., p k , b k+1 , b k+2 , . . ., b d ) with k+1 target (multiplicative) shares and d−k source (Boolean) shares.After d iterations, the sharing has been converted to x = (p 0 , . . ., p d−1 , b d ) such that x = d−1 i=0 p −1 i ⊗ b d , which is equivalent to a type-I multiplicative sharing of x with p d = b d .Multiplicative to Boolean.For the opposite conversion from multiplicative to Boolean shares, we consider a type-II multiplicative sharing, but the procedure for type-I is identical, apart from d additional inversions.Note that the first iteration starts with k = 1 and b d = q d .In iteration k, we have the intermediate sharing x = (q 0 , . . ., q d−k , b d−k+1 , . . ., b d−1 , b d ) with k target (Boolean) shares and d+1−k source (multiplicative) shares.In the expansion phase, a new Boolean share b d−k is added by splitting b d into b d ⊕ b d−k with b d−k randomly drawn.The d + 2 shares of x are then

Figure 4 :
Figure 4: Circuit for the shared Kronecker delta function δ(x) for n = 8

Figure 5 :
Figure 5: First-order adaptive masking implementation of the AES S-box.The dotted grey lines depict registers.

Figure 6 :
Figure 6: State and Key Array ), we do the adapted ShiftRows that puts each state byte one extra column to the right.The first MixColumns operation is in the next cycle (23), which means the first input byte to the Kronecker delta function (in State 30) is ready in cycle 0.During cycles 23 to 2, State 00 holds bytes of the last column and is thus fed back into State 33.The MixColumns operation occurs four times every four cycles, i.e. in cycles 23, 3, 7 and 11 (except in the last round of encryption).

Figure 8 :
Figure 8: Non-specific leakage detection test on 2.5 rounds of encryption of a first-order protected AES.Left: PRNG off; 12 000 traces.Right: PRNG on; 50 million traces.Rows(top to bottom): exemplary power trace, first-order, second-order t-value.

Figure 9 :
Figure 9: Non-specific leakage detection test on 2.5 rounds of encryption of a second-order protected AES.Left: PRNG off; 12 000 traces.Right: PRNG on; 50 million traces.Rows(top to bottom): exemplary power trace, first-order, second-order, third-order t-value.

Figure 10 :
Figure 10: Evolution of the maximum absolute t-value across the measurements.Left: First order.Right: Second order.

Figure 11 :
Figure 11: Non-specific bivariate leakage detection test on 15 clock cycles of a first-order protected AES.Left: 1 million traces.Right: 45 million traces.

Figure 12 :
Figure 12: Non-specific bivariate leakage detection test on 15 clock cycles of a second-order protected AES with 50 million traces.

Figure 13 :
Figure 13: Conversion from Boolean to multiplicative masking with second-order leakageTo see this, consider the case when V 1 = 0. (This occurs with probability 1 |Fq| .)Then b 0 ⊕ b 1 = 0 since r 0 = 0 by construction.This implies that the second intermediateV 2 = b 2 = b 0 ⊕ b 1 ⊕ b 2 leaks the sensitive value x.As a result, the valueE[L 1 (V 1 ) • L 2 (V 2 )|X = x]depends on the secret input x for any device leakage behavior functions L 1 , L 2 , including the Hamming weight leakage behavior functions.This can be verified with the following MATLAB script.

Figure 15 :
Figure 15: Circuit for the shared Kronecker delta function δ(x) for n = 8 However, this distribution does not change if we vary the secret.Two Probes.In the second-order implementation, pairs of probes in the B stages also result in uniform distributions [ In A stages we see the distribution [ most pairs.Since this is the outer product of [ multivariate probe of a B-stage wire and a wire in the next A stage results in distributions [ , except when we combine a cross product t ij with share i or j of one of the multiplication inputs.In those cases, we see probability distribution [ Again, these distributions are not uniform but they are independent of the secret.
al. (Consolidated Masking Schemes) [RBN + 15].Reparaz et al. also discuss how ISW can be implemented to provide security on hardware.More recently, Gross et al. presented Domain Oriented Masking

Table 1 :
State and key control during one round of encryption

Table 2 :
State and key inputs during one round of encryption (except during loading)

Table 3 :
Proof that the S-box in Figure 7 is d-SNI for d = 1, 2

Table 4 :
Implementation results for the AES S-box with Nangate 45nm Library

Table 7 :
Simulation of intermediate probes I and output probes O such that |I| + |O| ≤ d = 2 using |I| input shares for the second-order multiplicative to Boolean conversion.

Table 8 :
Simulation of intermediate probes I and output probes O such that |I| + |O| ≤ d = 2 using |I| input shares for the second-order Boolean to multiplicative conversion.