Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware

. Streamlined NTRU Prime is a lattice-based Key Encapsulation Mechanism (KEM) that is, together with X25519, the default algorithm in OpenSSH 9. Based on lattice assumptions, it is assumed to be secure also against attackers with access to large-scale quantum computers. While Post-Quantum Cryptography (PQC) schemes have been subject to extensive research in recent years, challenges remain with respect to protection mechanisms against attackers that have additional side-channel information, such as the power consumption of a device processing secret data. As a countermeasure to such attacks, masking has been shown to be a promising and eﬀective approach. For public-key schemes, including any recent PQC schemes, usually, a mixture of Boolean and arithmetic techniques is applied on an algorithmic level. Our generic hardware implementation of Streamlined NTRU Prime decapsulation, however, follows an idea that until now was assumed to be solely applicable eﬃciently to symmetric cryptography: gadget-based masking. The hardware design is transformed into a secure implementation by replacing each gate with a composable secure gadget that operates on uniform random shares of secret values. In our work, we show the feasibility of applying this approach also to PQC schemes and present the ﬁrst Public-Key Cryptography (PKC) – pre-and post-quantum – implementation masked with the gadget-based approach considering several trade-oﬀs and design choices. By the nature of gadget-based masking, the implementation can be instantiated at arbitrary masking order. We synthesize our implementation both for Artix-7 Field-Programmable Gate Arrays (FPGAs) and 45 nm Application-Speciﬁc Integrated Circuits (ASICs), yielding practically feasible results regarding the area, randomness requirement, and latency. We verify the side-channel security of our implementation using formal veriﬁcation on the one hand, and practically using Test Vector Leakage Assessment (TVLA) on the other. Finally, we also analyze the applicability of our concept to Kyber and Dilithium , which will be standardized by the National Institute of Standards and Technology (NIST).


Introduction
Wide deployment of Post-Quantum Cryptography (PQC) algorithms in devices and applications is indispensable, even though there is no guarantee that the advent of large-scale quantum computers will happen at all.For many application scenarios, security-critical implementations must also provide security against physical attacks.
• Compared to other existing fully masked PQC FPGA implementations, our implementation has similar (in the case of Saber) or significantly lower (in the case of Kyber) resource requirements for first-order security.
• We present the first arbitrary-order masked SHA-2 hardware implementation in literature.
• The side-channel resistance of our implementation is formally verified using VER-ICA [RFSG22] and practical measurements.

Preliminaries
In this section, we briefly introduce the notations used throughout this work.Afterward, we recap masking and important composability notions.Eventually, we describe Streamlined NTRU Prime and particularly the decapsulation.

Notation
Throughout this work, we denote R q = Z q [x]/(x p − x − 1), and R 3 = Z 3 [x]/(x p − x − 1) with p, q being primes.Furthermore, we write x[i : j] for bit vectors of length |i − j| + 1 and also allow multiple dimensions for this, e.g., x[i : j, k : l] is a vector of |i − j| + 1 bit vectors each of length |k − l| + 1.For masking, we use d as the masking degree, i.e., the number of probes an attacker has access to.It follows that we split secrets into d + 1 shares, referring to a single share as x (i) with 0 ≤ i ≤ d.Moreover, we denote x (0:d) as a masked variable.At any occurrence of Boolean operations that involve masked variables, we assume to perform this securely, e.g., by means of a secure gadget.Finally, we stress that x (0:d) denotes inverting the secret value by inverting one share rather than inverting each share (which would not invert the secret value for odd d).

Masking
Masking is an approach based on Shamir's secret sharing.It has been proven as an effective countermeasure against power or EM side-channel attacks by splitting secret values into uniform random shares.In our work, we employ only Boolean masking where a secret value x is split into d + 1 shares x (i) , such that x = d i=0 x (i) .While functions that are linear or affine in the masking domain can be applied trivially to each share individually, we use specialized methods to secure non-linear functions like AND or OR operations.
The core concept of gadget-based masking is to replace individual hardware gates with secure versions, so-called gadgets.These secure hardware gates are designed to not leak their inputs and outputs via power consumption or EM emanation.Early versions of these secure gates aimed at ensuring that the power consumption remained constant, regardless of the input or output.An example of such an approach is Wave Dynamic Differential Logic (WDDL) [TV04], which is a type of Dual-Rail with Precharge (DRP) logic.These logic gates use differential inputs and outputs and have a pre-charge phase which ensures that transistors switch at every clock cycle, even if the inputs do not change.An overview of other DRP logic styles can be found in [DGBN09].However, many of the DRP gates were successfully attacked over the years [SGD + 09, MKEP11, PKZM07, DGBN09].These attacks were possible due to effects such as unbalanced routing of the differential signals or glitches in the circuit.
In order to properly evaluate and formally verify the resistance against side-channel attacks of such special gates, a range of different attacker models have been proposed in the past.In 2003, Ishai, Sahai, and Wagner [ISW03] introduced the d-probing model, which is still frequently used as an appropriate abstraction.However, this model neither includes glitches nor transitions or couplings and thus has been extended to the robust d-probing model incorporating these phenomena [BGI + 18, FGP + 18].
Nevertheless, the robust d-probing model is insufficient to analyze the composability of gadgets (nowadays also called gate-level masking).Hence, Barthe et [CGLS21].HPC allows instantiating an arbitrary-order masked SecAND gadget with two clock cycles latency for one input and one clock cycle for the other input denoted as HPC1.Moreover, they optimized this gadget for less randomness demand denoted as HPC2 gadget.Following this, Knichel et al. proposed Generic Hardware Private Circuits (GHPCs) to build more complex PINI gadgets [KSM22].Finally, in a recent work, Knichel and Moradi presented HPC3 achieving lower latency by using more fresh randomness [KM22].

Streamlined NTRU Prime
Streamlined NTRU Prime is a lattice-based Key Encapsulation Mechanism (KEM) that is resistant against both classical and quantum adversaries [BCLv17, BBC + 20].It has been designed carefully using structured lattices while firmly avoiding potentially exploitable attack surfaces.In particular, it eliminates decryption failures and employs large Galois groups instead of cyclotomics.
Streamlined NTRU Prime defines Short as the set of polynomials in R q with exactly w non-zero coefficients from {−1, 1}.Furthermore, we also use the notation of an underline indicating that the respective value is encoded.
As a KEM, it uses the Fujisaki-Okamoto transform to achieve indistinguishability under chosen-ciphertext attacks (IND-CCA) and builds upon a public-key encryption scheme that fulfills one-wayness against passive attacks.In the following, we describe the three procedures of the KEM: key generation, encapsulation, and decapsulation.Key Generation.First, a uniform random polynomial g in R 3 is generated.This step is repeated until g is invertible in R 3 .Then, the inverse polynomial of g is computed.Furthermore, f is sampled to be a polynomial from Short.The secret key consists of f and g −1 as well as a random bit string ρ which is used for implicit rejection during decapsulation.Finally, the public key is computed as h = g/(3f ) where h ∈ R q .
Encapsulation.The first step is to sample a uniformly random polynomial r from Short, which is then multiplied with the public key polynomial h.In the resulting polynomial, each coefficient is rounded to the nearest multiple of three.The output of this operation is denoted as the polynomial c.Subsequently, the encoded r and the encoded public key are hashed to create the ciphertext confirmation hash.The confirmation hash together with the encoded c is the ciphertext.The session key is computed by hashing the encoded r and the ciphertext.
The first w elements are 1, the rest 0 8: c ∈ R q := Round(hr ) re-encrypt with h, r , compute new ciphertext c 9: c := Encode(c ) 10: r := Encode(r ) 11: γ := hash 2 (hash 3 (r ), hash 4 (K))) re-compute the ciphertext confirmation hash return hash 1 (hash 3 (r ), C) 15: else 16: Decapsulation.The decapsulation is shown in Algorithm 1 in detail.The basic idea is to remove the denominator of the public key from the ciphertext by multiplying 3f in R q .The subsequent application of modulo 3 to each coefficient removes the rounding error which is succeeded by the multiplication with 1/g ∈ R 3 to also remove the numerator of the public key and to obtain the plaintext.This plaintext is checked to be in the correct space Short.Furthermore, to ensure that no chosen-ciphertext attack is carried out, the obtained plaintext is re-encrypted and the result is compared to the original ciphertext.If everything matches, the correct session key is reconstructed, else an implicit rejection is performed by using ρ.Note that this final rejection step is strictly required to be performed in constant time.

Conceptual Considerations
To implement the decapsulation as shown in Algorithm 1, we essentially need six major modules: 1. Polynomial multiplication with operands in (R q , R 3 ) and return values in R q , 2. Polynomial multiplication with operands in (R 3 , R 3 ) and result in R 3 , Standard Approach.Usually, to mask polynomial multiplication modules, additive masking would be applied, with either multiple polynomial multipliers being instantiated in parallel, or one polynomial multiplier being instantiated that processes the shares consecutively.Moreover, two of the three multiplications have one public and one secret input, which can be realized very efficiently by applying additive masking as it only requires d + 1 polynomial multiplications and no re-sharing.The other multiplication, however, has two secret input polynomials.In order to perform a secure polynomial multiplication in the additive domain, d 2 +d 2 fresh random polynomials need to be sampled.Additionally, 2(d 2 + d) polynomial additions and d 2 + d polynomial multiplications must be performed.
In contrast, masking the reduction, weight check, and rounding is non-trivial in the arithmetic domain and would be solved in the Boolean domain.Finally, SHA-512 uses 64 bit additions, which is efficient in additive domain and feasible but less efficient in Boolean domain, as well as non-linear Boolean operations that strictly require Boolean masking.
In summary, this traditional approach is expected to yield a relatively efficient implementation at the cost of converting between additive and Boolean masking domain multiple times.Moreover, this type of implementation is often very specific in terms of masking degree, i.e., not being parametrizable.Besides, the wide variety of applied techniques produces a larger attack surface, as shown in recent attacks on masking conversions [NWDP22].

Applicability of Gadget-based Masking.
To overcome these downsides, we follow a recent line of research from the field of masking symmetric cryptographic schemes: gadget-based masking.For schemes in symmetric cryptography, we usually find a Boolean description that enables masking them at the gate level.This differs for public-key and post-quantum cryptography as these schemes typically employ arithmetic operations on number-theoretic structures such as multiplications in polynomial fields.Polynomial multiplications, however, consist of modular multiplications and additions in some finite number field.While the modular additions can be masked easily in Boolean domain through a secure adder, the modular multiplications are vastly more complex and are deemed infeasible to be masked in the Boolean domain.
However, for Streamlined NTRU Prime, we observe that the three polynomial multiplications each have at least one factor in R 3 .Consequently, if we employ schoolbook multiplication, the underlying coefficient multiplication-accumulation has an input from Z q being multiplied either 1, 0, or -1 and then accumulated to another value in Z q .We immediately observe that no complex modular multiplication must be carried out in this case.Instead, we can securely multiplex between the input coefficient from Z q , its precomputed additive inverse, and zero.The result is added securely to the accumulation value.As indicated before, all other operations are already feasible in Boolean domain, enabling the first fully Boolean masked implementation of a public key and post-quantum secure scheme.
In the following, we describe our design considerations for each module in Boolean domain.Note that in contrast to conventional hardware development, where it is desirable to have as many NAND gates as possible as they are the smallest gates, the design goal in our case is to have as few as possible SecAND gadgets, as they require fresh randomness.Throughout our design, we use the HPC2 SecAND gadget [CGLS21].

Polynomial Multiplication
Polynomial multiplications are the most expensive operations in decapsulation.Thus, research usually focuses on improving their performance [Mar20, PMT + 22, CHK + 21, ACC + 21].Instead, we focus on achieving a secure implementation.During decapsulation, two types of multiplications are required: 1. Multiplication in R q with one operand from R 3 (Lines 4 and 8 in Algorithm 1) and 2. Multiplication in R 3 (Line 5 in Algorithm 1).

Multiplication in R q
We observe that if we employ a standard schoolbook multiplication approach for both occasions of this multiplication, no coefficient multiplier is necessary.Instead, we use a secure adder and a secure three-way multiplexer.It is important to note that for both multiplications in R q , the input polynomial from R q is public, while the other factor from R 3 is secret.Thus, the idea is to compute the additive inverse of the input coefficient from R q , which is unmasked.Then, we securely multiplex with the masked select signal between both values and zero, and finally accumulate the result securely to the (intermediate) result coefficient.The architecture is shown in Figure 1.
Secure Multiplexing.Furthermore, we need a secure three-way multiplexer.The three public input signals are z = 0, a p = a, a n = q − a ∈ Z q .However, here we view them as Boolean values in F 13 2 .The secret select signal is (f [1], f [0]) ∈ {(0, 0), (0, 1), (1, 1)}.We perform two consecutive secure 2-input multiplexing operations: Note that the public inputs can be set as first shares and all other shares are just zeros.This is the reason why we can simply omit z in Equation 2. The SecAND gadget generates a uniformly random output also for the case that (f 1 , f 0 ) = (0, 0).Secure Addition.Parallel prefix adders can achieve efficient addition in hardware.These concepts also have been adapted to the Boolean masked domain first in [SMG15].This was followed by a broader examination of more recent techniques like threshold implementation and gadget-based masking [BG22], which we deploy for our work.

Multiplication in R 3
For mulitplications in R 3 only nine possible input combinations with three output combinations exist.Thus, we develop a direct Boolean masking utilizing the fact that the single inputs have a limited range.Multiplying two signed two-bit coefficients e[1 Then, we add r[1 : 0] (0:d) to the accumulation value a[1 : 0] (0:d) and map the result back to the signed a [1 : 0] (0:d) ∈ {−1, 0, 1} which can be done with the following formulas that take into account that only 00 2 , 01 2 , 11 2 are valid inputs:

Schoolbook Polynomial Multiplication
Generally, there are three approaches for this: Either we rotate one of the input polynomials or the output polynomial.For our two "big" multiplications in R q , we have a small secret input represented by 2(d + 1) bits, a big public input represented by log 2 q bits, and a big secret output represented by (d + 1) log 2 q bits.Since shifting many data is expensive in terms of routing, Flip-Flop (FF) demand, and dynamic power consumption, the natural choice is to rotate either of the input polynomials.

Polynomial Reduction modulo x p − x − 1
For the schoolbook multiplication, we can directly perform the polynomial reduction.We observe that x p ≡ x + 1 mod x p − x − 1, which indicates that the uppermost coefficient (x p ) during rotation must be additionally added to the before lowermost coefficient.As we indicated before, we want to rotate either of the input polynomials.Applying this strategy to the R 3 polynomial would increase the coefficient range to [−2, 2] due to the extra addition during polynomial reduction.We would require a 5-way multiplexer instead of a 3-way multiplexer, increasing both area and randomness demand.Thus, we choose to rotate the public R q input polynomial and perform the polynomial reduction in the same domain.

Modular Reductions
For Streamlined NTRU Prime decapsulation, we require two different modular reductions.
Reduction Modulo q.This reduction is only applied for the accumulation within the R q polynomial multiplications.We decided to use the non-negative modular representation in the interval [0, q) only since we would need to check both for underflows and overflows in the centered representation.Therefore, the value to reduce only grows by a maximum of one bit and can only provoke an overflow.Thus, a conditional subtraction by q suffices, which we perform as follows.
We subtract q from all accumulation results and obtain the carry bit from that subtraction.If this is 1, we know an underflow occurred.Thus, we can use the carry bit to securely multiplex between the original accumulation value and the subtracted value.This keeps all intermediate values in the minimal interval [0, q).Reduction Modulo 3.For the modulo 3 reduction, we have given an input from Z q and want to reduce it to {−1, 0, 1}.We start with an unsigned 13-bit number z[12 : 0] and repeatedly exploit the relation 2 ≡ −1 mod 3. Note that all operations are carried out in masked domain, but we omit the masking notation when dealing with arithmetic modulo 3.
The result of this computation ranges from -6 to 7 and is represented by a signed 4-bit integer y[3 : 0] = −2 3 y[3] + y[2 : 0].We again exploit the above relation: This results in a value ranging from -1 to 3, represented by a signed 3-bit integer mod 3.This value can already be mapped to a value w[1 : 0] ∈ {−1, 0, 1} efficiently: One additional point to consider is that this modulo 3 calculation assumes an unsigned 13-bit number.However, in the NTRU Prime specification, the modulo 3 operation is used on signed 13-bit numbers, in the interval [−q/2, q/2] [BBC + 20].This means that numbers in the interval [q/2, q) must be treated slightly differently, as these were originally negative.However, the solution is simple: since q = 4591, and 4591 = 1 mod 3, we simply have to add 1 to the final result if the original number was in the interval [q/2, q).This addition can be in a similar way to the multiplication in R 3 (see Section 3.1.2).
Unsigned.We compute r [0, :] (0:d) ∨ r [1, :] (0:d) and accumulate the resulting shared bit vector with a log 2 w -bit adder.It follows that the signed representation demands fewer non-linear Boolean operations.For the secure adder, the same adder as used for the polynomial multiplications is applied.
Following this, we then bit-wise XOR the shared adder output with the public target weight w, and then OR all bits of the result together to a single shared result bit.

SHA-512
SHA-512 employs a Merkle-Damgård construction processing a 512 bit state divided into eight 64 bit words A, B, C, D, E, F, G, H.In order to update the state, SHA-512 implements seven adders (modulo 2 64 ), the two functions Σ 0 and Σ 1 , and the functions SHA-Ch and SHA-Ma.The former two functions Σ 0 and Σ 1 consist of simple shift operations by three different values for each function processing A and E, respectively.The outputs of the shifts are added together by XOR operations.SHA-Ch and SHA-Ma are non-linear functions processing E, F, G and A, B, C, respectively.
For our masked hardware implementation, we protect the seven adders by applying the concept of the masked adder introduced in Section 3.1.We instantiate a complete 64-bit adder to realize the correct addition.Masking Σ 0 and Σ 1 can be accomplished in a straightforward way since the shift operations do not introduce additional implementation overhead in hardware and all XOR gates can simply be replaced by secure XOR gadgets.
Finally, SHA-Ch and SHA-Ma are bit-wise operations that can be implemented in parallel to match the width of the adder to be used.Hence, we can modify the formulas for both to reduce the number of non-linear gates in order to minimize the amount of required randomness and the area overhead leading to

Encoding, Decoding & Comparison
Streamlined NTRU Prime defines multiple en-and decoding algorithms for transforming polynomials in R 3 and R q to and from byte arrays [BBC + 20].Decoding the ciphertext and public key can be done unmasked as they are both public.We use the decoder described in [PMT + 22].For decoding the secret polynomials f and g −1 , we also use the decoder from [PMT + 22], and apply masking afterwards.However, we need to securely encode r into a byte array to compute the confirmation hash and session key.For this, we apply masking to the R 3 encoder from [PMT + 22].This is straightforward as the encoder only consists of a shift register and a 2-bit adder.
In the original algorithm specification, the recomputed ciphertext polynomial c is encoded (line 9 in Algorithm 1) before the ciphertext comparison (line 13), using an R q encoder.However, the R q encoder requires a 16-bit multiplication which would be prohibitively expensive to implement securely.We instead compare the ciphertext polynomial coefficients directly, after which we compare the confirmation hashes.This prevents us from implementing the masked R q encoder.The masked ciphertext comparison is straightforward: We do a bit-wise secure XOR of the two ciphertext coefficients and then repeatedly OR the output together.

Implementation
After introducing the theoretical background of masking all required operations, we now discuss the implementations of each building block.
Add13 and Add64.In their work [BG22], Bache and Güneysu compare the Brent-Kung, Kogge-Stone, and Sklansky adder architectures in the context of Boolean masking.For gadget-based masking, the Sklansky adder turns out to be the optimal choice, having the same low latency as Kogge-Stone but less randomness demand while having a lower latency than Brent-Kung at the cost of slightly more randomness.
The 13-bit Sklansky adder with carry-out deployed in our implementation is shown in Figure 2a.For input bits a[i] (0:d) , b[i] (0:d) where i ∈ {0, . . ., 12}, we compute in each circle: Note that the dotted circle indicates that the input is all zero, and requires no computation.It is needed to compute the carry-out, as we have a 14-bit output, and ensures that the lower layers operate on the correct inputs.Each square node has four inputs, the two "left" inputs g , and computes the following outputs: Finally, note that all leaf nodes do not need to compute p (0:d) , as only the final g (0:d) values are needed.
The 64-bit adder works equivalently, though with a total of six levels.In this case, we do not need a carry-in or carry-out.
CSubQ.For the conditional subtraction with q, we take a similar approach.We instantiate another Sklansky adder with one public operand fixed to the two's complement of q.Then, after each addition (let us denote the result here as x (0:d) ), we perform this subtraction by q and obtain (q − x) (0:d) as well as the shared carry-out bit.Using this, we multiplex securely between x (0:d) and (q − x) (0:d) , selecting the former if the carry-out is one (indicating an underflow has occurred) and else the latter one.
The fixed input already enables vast optimizations by the synthesizer.Further improvements could be made by optimizing the adder architecture itself for a fixed operand.Since we know the positions of the zeros, we could simplify our adder as depicted in Figure 2b.
However, note that we did not implement these optimizations and have left them for future work.
The computation of all p values below the first row is the same as before.However, we can completely omit computing the first row of p, g as described in Equation 13 and Equation 14.Instead, we know, given an input a[12 : 0] (0:d) , for each circle in Figure 2b that In Figure 2b, the circles filled with the diagonal line pattern indicate that the fixed input bit of the two's complement of q is one.For the squares, we have four different cases now:

Non-filled
Computed as before.
Mod3 and Mul3.The architecture to compute Mod3 is depicted in Figure 3.For the secure additions and subtractions, we employ simple ripple-carry adders as parallel prefix adders have no advantage for these small bit widths.The Z 3 multiplier of the Mul3 module can also be directly implemented according to Equations 3 through 6 with the HPC2-SecAND gadget.The Mul3 module is fully pipelined, with a latency of five clock cycles.

Mux3 and Mux2.
Mux3 can be implemented with three pipeline stages as the HPC2-SecAND gadget has a delay of two cycles for one input and one clock cycle for the other.We instantiate 13 of these two-bit MUXes in parallel in order to feed Add13 without idling.
We have a delay of two cycles for Mux2, which has two secret data input and a secret select input.We instantiate 13 MUXes in the R q multiplier to select between the CSubQ output and the non-subtracted value.We also instantiate two multiplexers during the weight check calculation to select between the original r and the fixed vector.Finally, we use eight multiplexers to select between the encoded r and ρ after the ciphertext comparison.
SHA-Ch and SHA-Ma.Both the SHA-Ch and SHA-Ma can be directly implemented according to Equation 11 and 12 respectively with the HPC2-SecAND gadget.We implement both operations with a width of 64 bit, in order to be able to directly feed the output to the Add64 module.The SHA-Ch has a latency of two clock cycles, while SHA-Ma has a latency of three clock cycles.

Evaluation
After introducing our implementation concept, we present the corresponding implementation results in this section.Furthermore, we formally verify and perform practical measurements of our building blocks in order to demonstrate their protection against side-channel attacks.Eventually, we compare our hardware implementation of Streamlined NTRU Prime to a hardware design of Saber.

Implementation Results
We implement our design on a Xilinx Artix-7 device, using Vivado v2021.2(64-bit), for the sntrup761 parameter set.We also synthesize our design for an ASIC using the 45 nm Nangate open cell library.Table 1 shows the latency, frequency, and peak randomness demand per module and masking degree.As shown, the cycle count is dominated by the three polynomial multiplications, which take 93 % of all total cycles.At the same time, the peak randomness is always set by the 64-bit adder in the SHA-512 module.While the total cycle count is independent of the masking order, the maximum clock frequency varies: On an FPGA and at masking orders 1 and 3, the design reaches 200 MHz, but the maximum frequency is lower for masking orders 2, 4, and 5.For all three, the critical path lies in the SHA-512 module.For the ASIC, the design reaches a higher maximum clock frequency than the FPGA at first order, with 207 MHz.However, as the masking order increases, the maximum frequency drops off faster, reaching just 75 MHz at fifth order and 100 MHz at sixth order.Here, the critical path also lies in the SHA-512 module.
In Table 2, the footprint per module and masking degree is shown for Artix-7 FPGA.As expected, the area increases vastly with increasing masking degrees.Interestingly, for all masking orders, the SHA-512 dominates the resource cost consuming roughly 61 % of all LUT and FF.The next most expensive operation is the rounding during the re-encryption, followed by the R q polynomial multiplication.When comparing the ratios of cycle counts and the resources consumed, it is apparent that the current SHA-512 implementation is sub-optimal: it is too expensive when considering the whole design.In particular, the 64 bit adder is oversized.For a better ratio of cycles and resources consumed, using a smaller, e.g., 16 bit adder multiple times for each 64 bit addition, would be more efficient while adding only a comparatively minor number of cycles.Doing so would also allow the SHA-Ch and SHA-Ma gadgets to have smaller widths, saving further resources.Finally, this would reduce the maximum of random bits used per cycle.
Table 1.Latency, frequency, and randomness results after Place and Route (PnR).Note that the cycle count for SHA-512 is for a single 1024-bit block.We did not perform PnR for orders 6 and 7 for an FPGA, as they no longer fit into an Artix-7 FPGA.

Module Cycle Count
Maximum Randomness (bits per cycle) In the right part of Table 3, we list the gate equivalent area demand per module and masking degree for an ASIC.As we did not have access to a memory macro, we listed the memory footprint separately.We see similar behavior to the FPGA resource requirements, with the SHA-512 dominating the area footprint, followed by the rounding during the re-encryption.The total GE also grows significantly as the masking order increases, while the SRAM usage grows more slowly.
Different Masking Degrees for Decrypt and Re-Encrypt.In [ABH + 22], the authors reason that re-encryption must be protected at a higher level than decryption during decapsulation.Our design and all building blocks can be easily adapted to any masking order allowing a flexible configuration.However, doing so would decrease the modules that can be reused across the design, e.g., the R q multiplier, which is used both during decryption and re-encryption.

Side-Channel Evaluation
In order to evaluate the protection against side-channel attacks, we rely on formal verification of each of our submodules and additionally perform practical side-channel measurements based on Test Vector Leakage Assessment (TVLA).Evaluating the entire decapsulation by formal verification or practical measurements is infeasible for typical side-channel setups due to the huge amount of required clock cycles.Formal Verification.We formally verify the security of each module by using the recently presented verification tool VERICA [RFSG22].VERICA is constructed based on the verification concepts developed in the side-channel analysis tool SILVER [KSM20] and the fault-injection analysis tool FIVER [RBSS + 21].The formal verification of a target design is performed based on its (Verilog) gate-level netlist, which is transformed into a Direct Acyclic Graph (DAG) serving as circuit model.Each node in the DAG is associated with a Binary Decision Diagram (BDD) representing the Boolean function of the corresponding gate.This data structure allows efficient applications of statistical checks verifying sidechannel security in the glitch-extended probing model and composability notions.To this end, we analyze our modules in the glitch-extended d-probing model for different security orders.The security order d was configured accordingly to the security order of the design under test.For the evaluation, we use a machine equipped with an Intel Xeon CPU (E5-1660) running at 3.20GHz and 128 GB of RAM.VERICA is allowed to use up to 16 cores and 8 GB of RAM per core.The corresponding results are shown in Table 4.Note that all modules pass first-and second-order verification, while third-order verification is too complex for Mod3 and Mul3.For the Add13 and Add64 modules, we use the implementation by Bache and Güneysu [BG22], which has been analyzed by practical measurements.

Measurement Results.
As additional security analysis, we performed side-channel measurements of our first-order protected designs on a Sakura-G FPGA evaluation board, which is equipped with a Xilinx Spartan 6 FPGA.The target FPGA was supplied with a 4 MHz clock while the power consumption was measured via the voltage drop over a 1 Ω shunt resistor.The power traces were acquired by using a ZFL-2000GH+ Low Noise Amplifier (LNA) connected to a Spectrum M4 oscilloscope (8 bit resolution).The oscilloscope collected the data with a sample rate of 1.5 GS/s.To generate the required online randomness, we instantiated a Keccak core used as Pseudorandom Number Generator (PRNG).
The measurement results for 10 million power traces can be found in Figure 4, Figure 5, and Figure 6.For all experiments, we first plot a sample trace to document a proper setup of the measurement equipment.In the subsequent plots, we used Welsh's t-test to detect potential leakage.In general and for a low number of sample points, a threshold of ±4.5 is used to decide whether the design leaks information via the power consumption [SM15].To this end, we do not observe any notable leakage in the first order but -as expectedsome leakage in the second order.

Comparison
In Table 5, we compare our implementation against an unmasked implementation of Streamlined NTRU Prime and two first-order masked FPGA implementations of Saber and Kyber.To the best of our knowledge, we are the first to report a higher-order full FPGA implementation of any PQC scheme and the first to report a masked full ASIC PQC implementation.Thus, we cannot compare it to other higher-order implementations.
As expected, the two unprotected implementations are smaller, faster or both.The masked Saber implementation also has a comparable LUT and FF footprint to our first-order implementation and uses no BRAM but significantly more DSPs.However, it is about an order of magnitude faster.In contrast, the masked Kyber-512 implementation is bigger even than our fourth-order implementation, but only faster by a factor of 6.8 compared to our first-order implementation.Moreover, both the Saber and the Kyber-512 implementations only support first order, while our design can easily be instantiated at an arbitrary level, allowing protection against more advanced attacks.Finally, our masked gadgets have been formally verified to be secure, and we do not need any masking conversion which may be used in future attacks.
Table 6 shows how performance progresses for our implementation and provides a comparison to gadget-based implementations of AES.While the randomness overhead is Table 5.Comparison with other masked PQC implementations.All implementations are synthesized for Artix-7, except for Kyber, which is synthesized for Virtex-7.Note that for ASIC, we are the first to report a fully masked implementation of any PQC scheme.

Gadget-based Masking
There are several advantages in a gadget-based masked implementation.First, it is effortless to adapt to an arbitrary masking order.This obviously reduces the time required for the development.Moreover, no masking conversion can be attacked since there is none.The masking conversion was the target in the attacks against a first-order and third-order masked Saber implementation [NDJ21,NWDP22].Additionally, exchanging the underlying gadgets with others with the same latency properties is usually straightforward.For example, it could be possible to achieve a fault-secure implementation easily by deploying the work from [FRBSG22].In addition, while our design does not include an RNG, generating randomness is relatively straightforward and cheap in hardware: The recent work [CMM + 23] analyses the cost of securely generating random bits for use in masking, with costs of 20 to 30 GE or 3 to 4 LUTs per bit while using a round-reduced version of the Trivium cipher.This additional area is minimal compared to the area usage of our design, and would only add roughly 4 %.

Potential Improvements
We leave several potential improvements as future work and address them here.The polynomial multiplications have the most conspicuous latencies, where the two R q multiplications take 62 % of the decapsulation cycle counts, and the multiplication in R 3 takes another 31 %.To speed this up, it is possible to instantiate more adders in parallel at the cost of slightly more area and a potentially higher amount of randomness per clock cycle, depending on the grade of parallelism.Thus, halving the latency of both multipliers results in a 47 % speed-up at the cost of approximately 8 % more gate equivalents for the first-order ASIC implementation.Moreover, a potential area reduction can be achieved by optimizing the CSubQ module including a positive impact on the amount of required randomness.
Additionally, we want to stress that the specified encoding procedure for polynomials R q is suboptimal for hardware implementations, as it includes multiplications.This accounts for the four DSP slices required in the FPGA implementation and about 7.3 kGE in the ASIC implementation.However, alternatives would increase transmission sizes and obviously need a change of the Streamlined NTRU Prime specification.

Symmetric Core
As discussed in Section 5.1, masking the symmetric core (i.e., SHA-512) in Streamlined NTRU Prime consumes a considerable large part of the entire implementation's footprint and has the highest per cycle randomness consumption.It also limits the maximum frequency due to the high routing cost of the 64-bit Sklansky adder.Nevertheless, hardened SHA-512 implementations are widely deployed in industry and can, for example, be found in smartcards and secure elements.Thus, one could assume that a secure SHA-512 is already available and does not need to be implemented.If we exclude the SHA-512 from the area consumption (cf.Table 2 and Table 3), then the design is not only surprisingly small at first order, but the area overhead is much more moderate with increasing masking order.
Another possibility would be to replace the 64-bit Sklansky adder deployed in the SHA-512 module by a smaller one, trading area for latency.Moreover, it is possible to deploy no additional adder for the SHA-512 module by reusing the secure adder from the polynomial multiplication module.In this case, five consecutive 13-bit additions would yield the 64-bit addition.This would require cleverly scheduling the additions required by SHA-512 such that the 13-bit adder pipeline is maximally occupied.As can be seen from Table 2 and Table 3, the 64-bit Sklansky adder occupies about half of the area of the SHA-512 module and about a quarter of the overall area.
Additionally, in order to reduce the total area overhead introduced by the masked symmetric core in Streamlined NTRU Prime, the SHA-512 could be replaced by an implementation based on Keccak [BDPA13].As Keccak does not use an adder internally, it is significantly easier and cheaper to mask.Most notably, it can be implemented with a very low amount of fresh randomness [BDN + 13].In addition, as the critical path lies in the SHA-512 module for both FPGAs and ASICs, using Keccak would likely increase the maximum achievable clock frequency.However, this would deviate from the Streamlined NTRU Prime specification and would not be interoperable with other Streamlined NTRU Prime implementations.

Applicability to Kyber
The efficiency of our gadget-based masking is built upon the fact that the three polynomial multiplications that are carried out each include a secret polynomial with ternary coefficients, where the other one is either small and secret as well or has a big coefficient modulus and is public.This enables us to perform schoolbook multiplication in Boolean domain.Notably, Kyber has a similar property: Here, all polynomial multiplications have one public input polynomial with "big" coefficients modulo q = 3 329.
Moreover, the polynomial degree is far smaller, with 256 compared to 761 for Streamlined NTRU Prime, enabling a faster multiplication.For Kyber, 256 2 = 65 536 coefficient additions are to be performed per polynomial multiplication, whereas Streamlined NTRU Prime with p = 761 requires p 2 + p = 579 882 coefficient additions.However, Kyber requires more multiplications to be performed: for k ∈ {2, 3, 4}, it requires k 2 + 2k polynomial multiplications, as well as k 2 + 4k − 1 polynomial additions, whereas Streamlined NTRU Prime constantly requires three polynomial multiplications, one of which only uses "small" coefficients and is thus much cheaper.
We compare the cost in terms of the estimated number of coefficient additions in Table 7.As seen there, Kyber consistently requires fewer "big" coefficient additions than Streamlined NTRU Prime in the regarding security categories.Another advantage for Kyber is that during key generation, it features no operations that are infeasible to mask in Boolean domain, which is in contrast to Streamlined NTRU Prime, where this is not possible.The most complex remaining operations in Kyber, both for key generation and decapsulation, are (de-)compression and sampling for a centered binomial distribution using a Keccak output stream, both of which are feasible in Boolean domain.
One downside for Kyber is that the secret coefficients have the range of [−2, 2] or [−3, 3].This would require a more complex five-way or seven-way secure multiplexer.In addition, a gadget-based masked Kyber implementation would require a Number-Theoretic Transform (NTT) core: Kyber requires extending a seed into a public matrix of polynomials, which are assumed to be in NTT domain.Since the implementation would not perform multiplication in NTT domain, an inverse transform of each polynomial in the matrix would be required, resulting in k 2 inverse NTTs during decapsulation.Finally, it is noteworthy that the fact that Kyber uses the same polynomial ring for all security levels is no advantage for a gadget-based masked implementation since schoolbook multiplication is used for Streamlined NTRU Prime, which also allows for easy parametrization.On the other hand, Streamlined NTRU Prime changes the coefficient modulus over the parameter sets, which might require manual adjustments.Overall, we leave this as an interesting open idea for future work.

Applicability to Dilithium
From a side-channel point of view, multiplying the public challenge polynomial c with the secret key vector s 1 , followed by an addition to the nonce y, is the most critical operation for signature generation in Dilithium [SLKG23].We highlight that the challenge polynomial c is also public for rejected signature candidates [KLRBG23] and is sparse and ternary.The coefficients of the secret key polynomial vector s 1 are uniformly random with a small bound (i.e., there are only five or nine possible values).The nonce y, finally, is a secret with a large bound.Thus, the whole operation can be masked similarly to the technique explained in this paper but with sparse multiplication.Notably, no modular reduction for the coefficient accumulation is required as the maximum bound for each coefficient after the operation is lower than the modulus, and the subsequent bound check can be performed in signed representation.The multiplication-accumulation w − cs 2 of the secret w and s 2 and public c can also be masked using our method, though it does require modular reduction.
On the other hand, the other critical multiplication during signature generation -Ay -fulfills only the criterion of having one public (A) and one secret factor (y).The coefficients of y have 2 18 or 2 20 possible values (depending on the parameter set), rendering the gadget-based masking approach infeasible for this operation as a full multiplier is required.In addition, similar to Kyber, A is extended from seeds and expected to be in the NTT domain.

Conclusion
In our work, we have presented the first gadget-based masked implementation of any PKC scheme.Notably, it is competitive regarding area demand to other protected PQC implementations while offering reasonable latency.The main advantage is the ability to adapt the implementation easily to arbitrary masking orders.For the first-order secure instance of the implementation, 19 923 LUTs, 19 725 FFs, and 8.5 BRAMs are utilized, reaching a frequency of 200 MHz.Implemented as an ASIC, the first-order secure instance consumes 201k GE and 189 kbit SRAM, reaching a frequency of 207 MHz.This results in a latency of only 9.35 ms on an FPGA and 9.03 ms on an ASIC, with a peak demand of fresh randomness of 310 bit per clock cycle.While for higher masking degrees, the latency only increases slightly due to a lower frequency, the randomness demand increases to 3 100 bit per clock cycle for d = 4. Nevertheless, further optimization of the hashing module could significantly reduce the area and randomness consumption.Moreover, our first-order implementation is formally and practically verified to be secure.Finally, we also analyzed the applicability of our concept to the designated NIST standard algorithm Kyber, finding that gadget-based masking could be applied efficiently as well.

Figure 1 .
Figure 1.Architecture of the Rq polynomial multiplier.Blue modules operate on masked shares.
d) l and the two "right" inputs g (0:d) r , p (0:d) r (a) 13 bit Adder with Carry In (b) CSubQ with optimizations for q = 4591

Figure 2 .
Figure 2. Sklansky Adder Constructions First-order t-test results.Second-order t-test results.First-order t-test results.Second-order t-test results.

Figure 4 .
Figure 4. Measurement results for the SHA-Ma module (left) and the SHA-Ch module (right) using 10 million traces.Both modules are instantiated for d = 1.

Figure 5 .
Figure 5. Measurement results for the Mod3 module (left) and the Mul3 module (right) using 10 million traces.Both modules are instantiated for d = 1.

Figure 6 .
Figure 6.Measurement results for the Mux2 module (left) and the Mux3 module (right) using 10 million traces.Both modules are instantiated for d = 1.
al. introduced Non-Interference (NI) as the first composability notion in 2015 [BBD + 15].Although NI limits the leakage between shared intermediate results, it does not guarantee probing security of composed circuits.Therefore, Barthe et al. presented the notion of Strong Non-Interference (SNI) [BBD + 16], which ensures composability of gadgets.Eventually, Cassiers and Standaert proposed Probe-Isolating Non-Interference (PINI) [CS20] reducing the overhead introduced by SNI gadgets.PINI ensures that all shared AND gadgets are composable and XOR as well as NOT operations can be performed share-wise without refreshing.Bringing this concept to concrete instantiations of SecAND gadgets in hardware, Cassiers et al. proposed Hardware Private Circuit (HPC)

Table 2 .
FPGA area results after PnR.Note that this does not include the area needed for randomness generation.Not listed is the Digital Signal Processor (DSP) usage: 4 DSPs are needed as multipliers in the decoder, regardless of the masking order.

Table 3 .
ASIC area results in Gate Equivalent (GE), using the 45 nm Nangate open cell library.The area does not include SRAM cells, which are listed separately.Note that this does not include the area needed for randomness generation.The area for the Encode R3 entity is not available for masking orders one through three, as it was merged with its parent entity.

Table 4 .
Verification results of the protected submodules using VERICA.We report for each design the number of combinational gates, memory gates and the verification time.The verification of the expected security order is indicated by green check marks.All verification results marked with ∞ are not finished in a reasonable time by VERICA.