Side-Channel Protections for Picnic Signatures

,


Introduction
As the possible advent of a quantum computer threatens the security of widely deployed cryptographic schemes, the design of new quantum-resilient alternatives is a pressing task. Motivated by this issue, the US National Institute for Standards and Technology (NIST) is currently holding the Post-Quantum Cryptography (PQC) Standardization Process, in which Round 3 "finalists" and "alternate candidates" have been recently announced. Among them is Picnic, a signature scheme [ZCD + 20], which follows Ishai et al.'s MPC-in-the-head (or MPCitH, short for multi-party computation in-the-head) paradigm for constructing zero-knowledge (ZK) proof systems [IKOS07]. One of the attractive features of MPCitH-style signatures is that they require no number-theoretic hardness assumptions, since the typical construction of such schemes only relies on symmetric key primitives. Concretely, following the standard Fiat-Shamir paradigm [FS87], signatures in the MPCitH paradigm can be proven secure in the random oracle model, as long as the underlying hash function and block cipher are secure. Quoting [Nat20], "NIST also sees Picnic's reliance on only assumptions about symmetric primitives as an advantage in case the need arises for an extremely conservative signature standard in the future". scheme BBQ [dDOS19]. The preprocessing phase is independent of the witness, and is used by the parties and we observed overhead ranging from 1.8x to 2.8x (depending on the type of protections applied to hash function invocations). Since our hash function optimizations do void the provable side-channel-security of the signature scheme implementation as a whole, we experimentally verified the absence of leakage, to support our arguments that security is maintained in practice.
Masked implementation of SHA-3. We also implemented a masked version of SHA-3, optimized with M4 assembly, as none was freely available. As SHA-3 is common to many PQ schemes and the M4 is a common embedded target, we expect this will be of independent interest. The implementation supports a range of options, from slower but provably SNI-secure, to our much faster optimized options. We experimentally verified that there was an absence of leakage in the optimized version.

Related work
Since the seminal work by Ishai et al. [IKOS07] the MPCitH paradigm has been actively studied over the past decade. In particular, two closely related protocols ZKBoo [GMO16] and ZKB++ [CDG + 17] brought MPCitH closer to practice, leading to the submission of Picnic1 to Round 1 of the NIST PQC Standardization Process. Katz, Kolesnikov, and Wang [KKW18] extended the paradigm to MPCitH-PP and corresponding version, Picnic2, was added during Round 2. Kales and Zaverucha [KZ20] further optimized Picnic2 from various implementation aspects and accordingly proposed Picnic3. Although our masked implementation focuses on Picnic3, which is instantiated with KKW and the LowMC circuit [ARS + 15], our generic approach in Section 4 also applies to BBQ (KKW instantiated with the AES circuit) [dDOS19] and Baum and Nof's variant of KKW (instantiated over an arithmetic circuit for proving SIS instances) [BN20]. A similar offlineonline paradigm also appears as a notion called "sigma protocol with helper" [Beu20], used to construct proof of knowledge for systems of quadratic equations, improving a protocol of Stern [Ste94].
The stateful hash-based signature schemes XMSS and LMS are known to be relatively resistant to sidechannel attacks, as they basically use pseudorandom keys for each signature. Hence, their resistance against side-channel attacks rests on the resistance of the underlying pseudorandom number generator [ [MGTF19] (NIST PQC finalist) and qTESLA [GR19] (Round 2 candidate). While these masked signing operations do output a signature compliant with the existing verification algorithm, they rely on an additional non-standard hardness assumption for provable side-channel-security (see [BBE + 18, DOTT21] for details). The issue could be circumvented by modifying the "commit" message of the underlying Σ-protocol, but this in turn breaks the interoperability of the output signature. By contrast, signatures directly derived from our generic approach to masking KKW (Section 4) as well as NIo-secure Picnic3 implementation maintain interoperability, and may optionally make additional assumptions for improved performance. Performance-wise, the benchmarks on Cortex M4 given by Gérard and Rossi [GR19, Table 6] show much less overhead than ours: their firstorder protected qTESLA-I incurs only 2.1x overhead in signing clock cycles and requires 343 KB of fresh randomness, while provably secure masked Picnic3 is 5.5x slower than unprotected and consumed 158 MB of randomness. However, by trading provable security guarantee as we describe in Section 5, our empirically validated countermeasures achieve a lower overhead overall of 1.8x, requiring 2 MB of randomness. Giving a meaningful performance comparison with masked Dilithium [MGTF19] is hard, as they only provide benchmarks on Intel for the whole signing operation. Although their overhead for first-order protection is about 5.6x, we expect that it can be made faster on the M4 by using the platform's TRNG.

Preliminaries
Notation. We denote the set {1, . . . , T } by [T ]. The number of multiplication gates in a circuit C is denoted by |C|.
Security levels. The parameter sets for the algorithms submitted to the NIST competition must meet one of five security levels. Picnic defines parameters for security levels L1, L3 and L5, corresponding to the security of AES 128, 192 and 256, respectively. For instance, parameters at level L1 aim to provide 128-bit security against classical attacks.
LowMC. The LowMC block cipher is described in detail in Appendix E, and here we briefly review the notation. The block and key size are both n bits, the number of rounds is denoted r, and the number of S-boxes is denoted s. In the Picnic3 parameters, n = 3s, since the three-bit S-box is applied to the full state. There are also constants: K i are matrices used to compute the round keys; L i are matrices used for the linear layer, and R i are vectors used as round constants.
Formal Verification. In order to check our formal security analysis, we use the tool maskVerif developed by Barthe et al. [BBC + 19]. The tool provides an automatic and formal security verification of higher-order masking implementations based on the NI and SNI notions. Briefly speaking, it checks every possible attack combination within the implementation and either provides a security proof for the specified order or gives a list of potential attack targets in the implementation. The bottleneck of the tool is the order of the masking and the complexity of the implementation. It has been used in the literature to provide assurance of masked implementations [BBD + 15a, GSDM + 19, SEL21]. In this work, we use maskVerif to check SNI security of the basic components (namely, multi-party computation of multiplication in both online and offline phases) of the KKW proof system up to order 4. Although we further provide the verification scripts for the orders 5, 6 and 7, we did not run these, due to higher combinatorial complexity maskVerif. However, our manual security analysis in Section 4 indeed guarantees SNI security of higher-order masking. Furthermore, NIosecurity of our fully masked Picnic3 (see Appendix B.1 for the specification) was verified with maskVerif up to order 2.
Leakage analysis. In this work we follow the test vector leakage assessment (TVLA) method by Goodwill et al. [GJJR11], which is based on Welch's t-test. TVLA implements a pass-fail test to decide if an implementation has exploitable leakage. It detects leakage at a given order and has two different versions: the non-specific and the specific method. The first version is defined as fixed-vs-random (FvR) and it aims to detect all possible first-order leakages. During the trace collection phase, a set of side-channel traces is collected by processing either a fixed input or a random input under the same conditions. If the t-test only gives a very small value, this indicates that the run of the algorithm on a fixed input is indistinguishable from a run of the algorithm on a random input. Hence, a small value in the fixed-vs-random scenario implies the absence of sensitive leakage. The second test is defined as random-vs-random (RvR), and employs only traces with a random input using a function of inputs to sort the traces. The main advantage of the RvR method is that it can identify specific exploitable leakages and thus shows the feasibility of an actual attack.
After collecting and sorting the traces, the means (µ 0 , µ 1 ) and standard deviations (σ 0 , σ 1 ) for the two , where n f and n r denote the number of traces for the sets, respectively. The t-test indicates whether the two distributions have the same mean, i.e., they are indistinguishable for a first-order side-channel analysis. We apply the customary threshold values for long traces suggested by [DZD + 18]. We use the value 5.7 for traces of length more than 10 4 , and the value 6.1 for traces of length more than 10 6 . The threshold rejects the null-hypothesis of non-leakage with > 99.99% probability.

MPC-in-the-head with preprocessing
MPC-in-the-head. We first describe the basic approach to construct a zero-knowledge proof of knowledge (ZKPoK) system for an arbitrary nondeterministic polynomial-time (NP) language L, following Ishai et al. [IKOS07] and its generalization due to Giacomelli et al. [GMO16]. Given L, we can define an NP relation R(x, w) which returns 1 if its input consists of a valid pair of statement x ∈ L and corresponding witness w, and outputs 0 otherwise. An MPCitH proof system (P, V) is built upon some N -party MPC protocol that jointly computes a function f , where f takes x and w as public and private input, respectively, and outputs f x (w) = R(x, w). For example, for a given encryption algorithm Enc of a block cipher like LowMC [ARS + 15], one can define f x (w) := Enc(sk, p) ? = c, where the statement x = (p, c) is a plaintextciphertext pair, and witness w = sk is a private encryption key, respectively. In this case, the prover P proves knowledge of a private key that produces a certain public ciphertext from the corresponding public plaintext.
At a high level, an MPCitH prover P attempts to convince the verifier V that they hold a valid witness w, by letting V check that the MPC protocol has been correctly carried out "in P's head" on input w. We now consider an MPC protocol Π C for the corresponding arithmetic circuit C defined over a finite field F, where the statement information x (e.g., the plaintext-ciphertext pair) is hard-coded such that C(·) = f x (·). We assume that the witness is expressed by an n-dimensional vector and C takes a set of n input wires denoted by IN. We write w = (w) w∈IN ∈ F n for the complete input. 4 To initialize the protocol, the prover P first additively secret shares each input w wire such that w = w 1 + . . . + w N in F, and considers each share w i as a private input to a party P i . Then P internally runs Π C to obtain view 1 , . . . , view N , where each view i consists of P i 's private input w i , the random tape of P i and all incoming messages that P i observes during the execution of Π C . The proof system now proceeds by following the typical "commit-challenge-response" flow. Using a secure commitment scheme, P sends Commit(view i ) for all i ∈ [N ] as the first message. Upon receiving distinct challenges i 1 , . . . , i t ∈ [N ] from the verifier V, the prover P sends back the corresponding t views view i1 , . . . , view it as well as the commitment opening information as a response. Finally, the verifier V accepts the proof iff the opened views are consistent with each other and they produce 1 as output of the protocol Π C . The (honest verifier) zero knowledge is guaranteed as long as the underlying MPC Π C has t-privacy in the semi-honest model (i.e., the distribution of any ≤ t views during an honest execution of the protocol is polynomial-time simulatable, given the output from Π C and corresponding ≤ t parties' private input).
MPC in the preprocessing model. In work following [GMO16,CDG + 17], Katz, Kolesnikov, and Wang [KKW18] showed that a particular communication-efficient MPC protocol in the preprocessing model is well suited to MPCitH proofs, and variants of their protocol appear in subsequent work [dDOS19, BN20,KZ20]. The core idea of MPC in the preprocessing model is to split the protocol Π C into an offline phase Π off C and an online phase Π on C . Importantly, the offline phase Π off C can be computed independently of the witness. By precomputing correlated randomness in advance during Π off C , one can reduce communication in Π on C drastically. In the traditional MPC setting, this was already used, e.g., in SPDZ [DPSZ12], MiniMAC [DZ13], and TinyOT [NNOB12]. While the original KKW proof system is focused on the protocol for boolean circuits, it also works with arithmetic circuits in a straightforward manner as observed in [dDOS19, BN20], so we present the latter case here for the sake of generality.
(Offline Phase) The offline phase Π off C of KKW works as follows: for each input wire w ∈ IN to the circuit C, and for each output wire z from all the multiplication gates, each party P i locally generates random shares λ w i , λ z i ∈ F using its own random tape. Then the parties compute random shares for all internal wires, by running the circuit: for each addition gate that takes wires x and y as input, party P i locally computes a new share λ z i = λ x i + λ y i for the output wire z; -for each multiplication gate that takes wires x and y as input, party P i obtains shares of the multiplication triples (sometimes called Beaver triples [Bea92]) (λ x i , λ y i , λ xy i ), such that λ xy = λ x λ y . (Multiplication Triples) To generate multiplication triples in the MPCitH setting, the parties choose λ x and λ y implicitly by reading their shares from their random tapes. Then to obtain shares of λ xy , the first N − 1 parties read random shares from their random tapes. As the prover P knows all the shares, P can simply solve for the N -th party's share so that the shares reconstruct λ xy , as required. We call the sequence of values λ xy N for all multiplication gates auxiliary information, denoted aux ∈ F |C| . Note that the complete information needed for the first N − 1 parties can be derived from their respective seeds seed i used to generate P i 's tape. The information needed for party P N can be derived from seed N and from aux. Hence, we define each party P i 's state information as follows: for all i = 1, . . . N − 1, let state i := seed i , and for P N we have state N := seed N ||aux.
(Online Phase) Given the preprocessed state information, the online phase Π on C proceeds by computing the masked witnessŵ = w + i∈[N ] λ w i for each input wire. Now, each gate takes (masked) inputŝ λ y i and can be computed as follows, where all computations on shares are carried out in F: -Addition: each P i locally computesx +ŷ.
-Addition by constant c: each P i locally computesx + c.
-Multiplication by constant c: each P i locally computes c ·x.
-Multiplication: this computation consumes a single triple ((λ x i ) i∈ [N ] , (λ y i ) i∈ [N ] , (λ xy i ) i∈ [N ] ). Each party P i first locally computes s i = λ z i −x · λ y i −ŷ · λ x i − λ xy i and broadcasts s i . Then the masked outputẑ = xy + i∈ [N ] λ z i can be obtained asẑ = i∈ [N ] s i +xŷ by each party. Notice that Π on C only broadcasts once for each multiplication gate, thanks to the correlated randomness computed during the offline phase. All other operations are computed locally by the parties.
Protocol. Below we present a basic framework for three-round MPCitH-PP proof systems. Here we describe the protocol for one MPC instance (with non-negligible soundness error) and in Fig. 6 we include a complete description of the KKW proof system that uses many instances in parallel (to achieve negligible soundness error). As the offline protocol proceeds independently of the secret witness, an MPCitH-PP prover can safely open the states of all N parties for the verification of the preprocessing phase (i.e., triple generation).
(Commit) The prover P first samples a random seed for each P i and executes Π off C to obtain the states of all N parties after the offline phase. Then using these states and the masked witness (ŵ) w∈IN as input, P executes Π on C to obtain all broadcast messages observed during the online phase. Finally, P sends commitments to the states and broadcast messages to the verifier V.
(Challenge) V asks P to open either the offline or the online phase. For the latter case, V also randomly picks a party index i , whose view is to remain hidden.

(Response)
To open the offline phase, P sends all random seeds used during Π off C . To open the online phase, P sends broadcast messages coming from the party P i during Π on C , as well as all the state information of the remaining N − 1 parties.
(Verification) To check the offline phase, V simply uses random seeds to execute Π off C as P would do, to obtain the resulting states of all N parties. Then V checks that these states form a correct opening to the commitment of the offline phase. To check the online phase, V simulates Π on C with the broadcast messages from P i and the states of the remaining N − 1 parties as input, so as to obtain the broadcast messages of the other N − 1 parties. Then, V checks that these broadcast messages form a correct opening to the commitments of the online phase.

Picnic
The signature scheme Picnic is an instance of the MPCitH paradigm described above. The function f is the LowMC block cipher, the signer's secret key is the witness w, and the public key is (x, c). A signature consists of a proof of knowledge of w such that f x (w) = c. In the block cipher notation, if the secret key is denoted sk, then the public key is a plaintext-ciphertext pair (x, LowMC sk (x)) where x is a randomly chosen plaintext block, and the signature proves knowledge of a key relating the plaintext x and the ciphertext LowMC sk (x). The proof is made non-interactive by the Fiat-Shamir transform, and the message to be signed is bound to the proof by hashing it into the challenge.
The Picnic specification [Pic20] and NIST submission includes parameter sets using both the ZKB++ proof system and the KKW system, as well as specific choices of parameters for LowMC. Since the KKWbased parameters (referred to as Picnic3) are the most efficient in terms of signature size, we choose to focus on those in this paper. In particular our masked implementation is limited to the parameter set Picnic3-L1. Fig. 6 describes the KKW proof system at a high level, and Algorithm 14 describes Picnic3 signing in full detail.

Side-Channel Attacks and Threat Model
Physical attacks are a threat for cryptographic implementations. Attacks such as side-channel analysis (SCA) can be used to extract secret keys by observing physical properties of an implementation, such as timing, power consumption or electromagnetic emanation [Koc96,KJJ99,QS11]. A popular countermeasure to SCA is known as masking [CJRR99a,ISW03]. Masking can protect against a broad class of SCA and probing attacks by splitting secrets into independent shares. A popular model for analyzing masking schemes is the t-probing model, where the adversary can probe one (or more) shares of the masked variable [ISW03]. Briefly, a probing adversary may invoke a cryptographic implementation multiple times with chosen inputs. Before each call, the adversary can choose a set of up to t wires of the circuit and observe the values on these wires during the invocation. After c calls, an attacker can then combine the c × t observations in arbitrary ways to extract information about sensitive variables. This model is closely related to concrete physical attacks: For example, in [KGM + 20], eight simultaneous probes are given as an upper limit achievable by modern commercially-available probe stations, and we therefore assume that t is at most sixteen. While the t-probing model is a clean theoretical model, Duc et al. [DDF19] showed that security in this model implies security in the more practical SCA-inspired noisy leakage model [PR13]. To protect against multi-probe attacks, generally higher-order masking is applied, where the number of independent shares is increased. Higher-order SCA is expensive, as the number of measurements needed grows exponentially with the masking order, effectively limiting the attack order or the number of simultaneous probes an adversary may use (which we assume is below sixteen). Kranchenfels et al. [KGM + 20] describes a new attack technique, laser logic state imaging, that can potentially have an unlimited number of probes, however this attack is quite new and may not be widely applicable, but in any case must be mitigated with countermeasures below the software level (at the package, device or circuit level). Besides probing and SCA, there are further physical attacks like fault analysis [BDL97] that we do not address in this work.
In this work, we use the noisy leakage model introduced by Chari et al. [CJRR99b] and extended by Prouff and Rivain [PR13]. The model enables an adversary to obtain each intermediate value perturbed with a noisy leakage function. Furthermore, as stated above we use the connection between probing and the noisy leakage model given by Duc et al. [DDF19]. Therefore in our threat model, a probing adversary reflects the capabilities of a real world adversary such as DPA. We assume an adversary who can access the physical device that can run the Picnic3 signature scheme. They can measure side-channel traces, such as power or electromagnetic emanation, of the device while signing chosen messages. Moreover they obtain the signatures as output, and can verify (and thus see the revealed values) or use them arbitrarily in an attack. Observe that according to noisy leakage model the side-channel trace contains each intermediate value perturbed with a noisy leakage function. Depending on the signature the revealed values vary and the adversary can employ these variables, to recover the secret. Remark that depending on the scenario the secret is changing (the details is given in Section 3). Therefore our countermeasures introduced (given in Section 4) to thwart these two scenarios.

Security Notions for Masking Countermeasures
For more comprehensive background we refer readers to Appendix F. In the following, we fix some finite field (F, 0, 1, +, −, ·). As explained above, we are working in the t-probing model which allows an attacker to obtain the value of t variables per run of the primitive. The most common technique to mitigate sidechannel attacks is by encoding sensitive variables via an additive (or polynomial-based) secret sharing into T > t parts. We say that a vector For readability, we often write v instead of (v j ) j∈ [T ] . For a subset I ⊆ [T ], let x I = (x i ) i∈I and furthermore I = [T ] \ I. Variables are shared both to protect against side-channel attacks and as part of the MPC protocol. To distinguish between these situations, we call a sharing between parties in the MPC protocol a sharing and an encoding when the goal is to protect against side-channel attacks. In this work, we aim to prove that our basic building blocks meet the following standard security notions [BBD + 16].

Definition 1 (t-NI, t-SNI). Let G be a gadget with inputs in
The above definition of SNI as well as its composability result can be generalized for a gadget with multiple input/output encodings [CS20]. We also say G is t-SNI with uniform output-distribution, if the outputs of G which are not affected by any probes are uniformly distributed (see Appendix F for a definition).

Probing Attacks on Picnic3
We describe two probing side-channel attacks against Picnic3 and experimentally confirm them against our M4 port of the optimized Picnic3 implementation. Probing attacks usually exploit weak leakage of intermediate variables, gathered from several measurements. As described in [SBWE20] and in [GSE20], the values revealed by the prover allowing the verifier to check the consistency of the MPC protocol can be employed by an adversary in a side-channel attack. We assume the same scenario. Furthermore, we assume a leakage model where an implementation leaks weak and noisy information about each intermediate variable, therefore measurements of the MPC-in-the-head simulation leak a weak and noisy dependence on secret values due to the revealed values. As mentioned above, we make use of the RvR tests to show the clear presence of leakage.

Probing the Masked Secret of Unopened Online Phase
This attack is novel, as it is specific to MPCitH with preprocessing, and only occurs when 5 protocol rounds are compressed to 3. Hence the attack below works in principle for any direct implementation of signatures derived from three-round KKW-based protocols. We also remark that this attack cannot be mitigated by the SNIitH approach [SBWE20]; in particular, the attack below works independently of the number of unopened parties' views since it targets an input to the MPC (i.e.,ŵ), not a share of the secret. We are thus motivated to design an alternative solution to thwart this attack in the next section.
We first note the three-round KKW scheme executes both the offline and online phase of each MPC instance, in contrast to the five-round case. We denote by C the executions chosen for the online phase, i. e., the executions where the offline phase is not public.
The attack exploits the following: if the k-th execution of the offline phase is selected to be part of the signature (i.e., if k / ∈ C), the preprocessed masks and state of all N parties is made public for the verifier, therefore the corresponding online phase must remain hidden. Concretely, since the secret witness wire value w is masked by random bits in step 1c of the prover in Fig. 6, the attacker's goal is to learn the masked witness wire valuesŵ (k) in execution k for the unopened online executions k / ∈ C. Sincê is made public (for all i), by probingŵ (k) the attacker can solve for the secret key bit w. Here, λ w,(k) i denotes the value of λ w i in execution k. In order to validate the attack we use our experimental setup (as described in Section 6) and the RvR approach. The experiment shows that there is an exploitable leakage, i.e., an amount of leakage sufficient, despite measurement noise, to allow recovery of intermediate values that depend on the secret key. For this experiment, we reduce the Picnic3 parameters (as in Fig. 6) M and τ to 4 and 2 respectively, in order to collect traces more quickly, however we keep the number of parties N as 16 and collected traces corresponding to execution of the first MPC instance. During the collection phase, we run the Picnic3 signing function with random messages and a fixed secret key. More specifically, we measure the execution of the first line of Algorithm 18, and collect 22,056 side-channel traces. Note that the root seed is also random due to the choice of a random message to be signed. We first separate the traces belonging to signatures that reveal the first preprocessing phase, since our measurement covers the first MPC instance. The reduced number of MPC instances is only to reduce the number of possible challenges and to increase the number of traces per challenge. Then we classify the remaining traces into two sets according to the revealed values λ w,(k) 1 . The result of the analysis in Fig. 1 (left side) shows a clear dependence between the unrevealed valueŵ (k) and the observable trace, as the |t|-value clearly exceeds 5.7 which shows an exploitable leakage. As seen in the right hand side of the Fig. 1, the leakage becomes clear after 2,725 traces.
The code we measure (the first line of Algorithm 18) corresponds to the calculation of roundkey 0 thus the leakage corresponds to the bits of roundkey 0 which is equal to matMul(ŝk, K 0 ). Solving the equation for the sk =ŝk − (λ sk 1 + · · · + λ sk 16 )K 0 where (λ sk i is known for all i and K 0 is a constant) leads to the secret value sk.

Probing the Unopened Party
The second attack uses the revealed values for the online phase i.e.ŵ (k) and λ w i for i = i k . This attack is a straightforward variant of the one by Gellersen et al.
[GSE20] and Seker et al. [SBWE20], but adapted to work with Picnic3. In contrast to the attack described above, we now target an MPC execution whose online phase is selected to be part of the signature (i.e., if k ∈ C). In that case there is a single party P i k whose internal state must remain hidden for the privacy of the MPC protocol to hold. By design, the valueŝ w (k) and λ w i for i = i k are revealed during the verification. Thus the measurements have a weak and noisy dependence to the value λ w i k which can be exploitable due to the revealed values. We validate the attack using the same experimental setup and parameters as in Section 3.1. During the collection phase, we again process the Picnic3 with random messages with a fix secret key and measure the execution of the preprocessing phase (Algorithm 15), where the following is computed: Sinceŝk = sk + λ sk , we have the following equation for λ i k , Finally, we substitute the secret value λ i k into Eq. (1), (3) we observe that roundkey 0 can be probed (over multiple traces) since sk is a constant value. Then λ i k can be calculated as roundkey 0 − i =i k λ i , and used to obtain the secret (as described above). The result of the analysis in Fig. 2 shows a clear dependence between the unrevealed value roundkey 0 and the observable trace, as the |t|-value clearly exceeds 5.7 which shows an exploitable leakage.

Masking Three-Round KKW
In this section we present our masked proof system following the three-round KKW protocol. We first strive for a provably NI-secure algorithm without specifying any particular circuit. In Section 5 we describe more concrete operations tailored to the LowMC circuit of Picnic3, and optimize by partially unmasking several hash computations (and discuss the security implications). The circuit in this presentation is a generic circuit C such that C((w) w∈IN ) = 1, where each w is seen as an input wire value to the circuit.
In the description below, we additively secret share some variables in two dimensions: a share held by each party (indexed by i ∈ [N ]), further shared T times within each party (indexed by j ∈ [T ]). For example, λ x i denotes a share of of λ x held by the i-th party such that λ The shares λ x i are as in the KKW protocol, and the extra T -wise encoding of this value is required for SNI security. Note that we only apply the notation · (see Section 2.4) on the T -wise encoding required for the masking countermeasure and never for the sharing of the KKW protocol. The functions requiring T -th order masked computation are marked in orange (e.g., y ← H( x ) indicates that a hash function H is masked). Most of the randomness used in the protocol is from the random tapes of the parties; this randomness is derived from a seed, so that part of it may be efficiently communicated to the verifier (by sending the seed). Some of the masked operations will require additional randomness (e.g., to refresh a secret encoding), and this is sampled from the platform random number generator (RNG), since it is only required by the signer.

Masked Operations
We present our masked version of the KKW prover in Fig. 3. The function masked offline in Algorithm 1 computes the offline phase Π off C from Section 2.1 in a straightforward way. The function masked online in Algorithm 2 corresponds to a masked version of Π on C . The SNI-secure multiplier SMul( ) is defined in Algorithm 11. Note that SMul( ) is t-SNI with uniform output-distribution, as evident from the proof that it is t-SNI [BBD + 16, Proposition 2]. For ADD gates, the only change is to work on T -encodings x , ŷ , rather than onx,ŷ directly. Interestingly, the MUL gates can also be computed with a straight-forward adaptation by also encoding the masks λ x , λ y , λ z , and λ xy . Recall thatx = x + λ x , and λ x is shared among the parties, where party i has share λ x i (resp. λ y i , λ z i , and λ xy i ). Each party thus stores their share λ x i as a T -encoding λ x i (resp. λ y i , λ z i , and λ xy i ). Each party's broadcast value s i will now also consist of the T -encoding s i as follows:

Security Analysis
We employ the definition of non-interference by [BBD + 16] which guarantees security against t probes for t < T as proposed by Ishai et al. [ISW03]. Recall that a probing adversary may invoke a cryptographic implementation multiple times with chosen inputs and before each call, can fix an arbitrary set of up to t wires of the circuit and observe the values during the invocation. We use a more polished security notion known as t-non-interference (t-NI) and t-strong non-interference (t-SNI) defined in Appendix F.
In the security analysis we focus on a single MUL operation as extracted from Algorithms 1 and 2, denoted KKW MUL, presented as Algorithm 22 and 23, as the ADD operation is linear and thus trivially NI. See Appendix G for the proofs of the lemmas.

Algorithm Masked KKW Prover
Inputs The prover holds a circuit C as a statement and an encoded witness are parameters of the protocol.
Commit For each k ∈ [M ], the prover does: 1. Choose uniform seed (k) and use to generate values seed 2. Commit to the offline phase:

Compute encodings of masked witness ŵ
is the randomness used to mask the witness and is read from the random tape defined by state  Response For each k ∈ [M ] \ C, the prover sends reconstructed seed (k) and com on (k) for all to the verifier. For each k ∈ C, the prover sends reconstructed values com i k to the verifier.

Algorithm 1 masked offline
Input: ( seedi ) i∈ [N ] . Output: aux . 1: for each input wire w of the circuit 2: read λ w i,j from seedi,j 3: for each gate in C with input wires x and y, and output wire z: 4: if ADD then 5: update auxj with λ xy N,j 15: return aux .

Algorithm 2 masked online
Input: Circuit C, ŵ for each input wire w of the circuit and ( state i ) i∈ [N ] . Output: ( msgs i ) i∈ [N ] 1: for each gate in C with input wires x and y, and output wire z: 2: if ADD then 3: compute ẑ ← x + ŷ 4: if MUL then 5: compute ẑ ← c + s 14: for each output wire z of the circuit : 15:

Lemma 1 (). Let G be the KKW MUL gadget as described in Algorithm 22. Then, G is t-SNI for all t < T , if SMul( ) is t-SNI with uniform output-distribution.
Lemma 2 (). Let G be the KKW MUL gadget as described in Algorithm 23. Then, To further support our security analysis, we utilized maskVerif [BBC + 19] to confirm that both KKW MUL offline and online gadgets are SNI-secure.
As we have shown that all components of Algorithm 1 and Algorithm 2 are SNI, the composability guaranteed by Lemma 3 as well as its generalization to multi-input/output SNI gadgets [CS20] implies the following theorem by adding suitable refresh gadgets (depending on the topology of circuit C that KKW is instantiated with). Note that SNI security of the refresh gadget (recalled in Algorithm 12) is already proved by Barthe Fig. 3 is t-NIo for all t < T and for public outputs {com The public outputs stated above are not part of unprotected KKW proof elements. We thus have to validate the security of proof system in case these values are made public, which, however, is straightforward since they are indeed non-sensitive information that can be computed from response outputs (see the verification step of Fig. 6). Furthermore, note that masking the hash function H( ) is important. There are scenarios where the inputs to H( ) are sensitive, but the outputs are not (such as step 5 in Fig. 3). The computation of an unmasked hash functions might thus leak information about these sensitive inputs.

Masking Picnic
We start by analyzing the hashing operations in signature generation to determine which ones must be masked, then discuss the options for masking SHA3/SHAKE, and introduce the half-masking technique, then estimate the overhead of masking the hash invocations. Our masked implementation of the Picnic3 signature generation function is a rather direct adaptation of the masked KKW proof protocol from Section 4. When compared to Section 4, the circuit is LowMC, and operations are done on N -bit words packed with a secret share from each of the N parties. Fig. 7 (in Appendix B) gives an overview of the protections for each hashing operation in signature generation. A complete specification of our protected implementation, mirroring the official Picnic specification [Pic20], is given in Appendix B.

Implementation Security
We implemented two versions of masked Picnic3: (1) a provably NIo-secure implementation (as a direct consequence of Theorem 1) and (2) a performance-oriented implementation with partially unmasked nonsensitive intermediate values. For the former, we inserted the share refresh gadget (RefreshM) according to the generic composition rule stated in Lemma 3, and we verified with maskVerif that our complete specification of fully masked Picnic3 (see Appendix B.1) is indeed NIo-secure. On the other hand, we do not claim that our second implementation is NIo-secure, as there are some gaps between the analysis of Section 4 and this implementation, that we consciously allowed, in order to improve performance. We now explain these gaps and argue that they do not impact practical security, and in Section 6 we confirm the absence of leakage experimentally.
First, according to Algorithm 4, all intermediate seeds must be T -encoded until they are reconstructed at lines 29 and 32 and, as we argue in Section 5.2, reading t bits of a seed reduces security by at most t bits. As we assume t to be small, we accepted this risk to reduce the cost of masking SHAKE and memory required to store seeds.
By selectively masking the hash function calls (as described in Section 5.2) as opposed to masking all hash function calls, up to t bits of a single-use seed may leak to a side-channel attacker capable of accurately reading t bits from a single trace. (Since the seed is only ever used once, and the signature is randomized, subsequent traces have a fresh seed.) Against such an attack, the security of our L1 implementation decreases from 128 to 112 bits. As shown in Sections 5.4 and 6 this optimization gives significant performance gains, so we see this as a reasonable trade-off. Recent work by Kannwischer et al. [KPP20] describes single-trace attacks on the unprotected XKCP Keccak implementation. These attacks use a single trace recorded during the computation of y = SHAKE(sk||x) and aim to recover all of a secret key sk, or part of y. While singletrace attacks could threaten some of the unprotected hash calls in our optimized implementation (e.g., when deriving the per-party or per-MPC instance seeds), the results of [KPP20] do not extend to the M4, and the length constraints on sk, x, and y in our application. Future work may improve single-trace attacks, and in that case the conclusion of [KPP20] is that lightweight countermeasures will provide effective mitigation.
Half-masking (discussed in Section 5.3) also introduces the assumption that KangarooTwelve [VWA + 21] is a secure hash function. This assumption is only for security against the type of t-probing side-channel attack we consider; and half-masking can be used by individual implementations without changes to the Picnic specification. We provide benchmarks in Section 6 showing the performance advantage of halfmasking: based on this and the fact that KangarooTwelve appears to be a relatively mild assumption, we enable half-masking by default in our implementation.
Finally, our provable security analysis assumed an SNI-secure hash implementation. Although one could use the fully SNI-secure masked Keccak as suggested by Barthe et al. [BBD + 16], other previous works [BDPA10, Dae17, GSM17] achieved more efficient implementations with smaller amounts of random bits, albeit without a provable security guarantee. We implement three instances of masked Keccak (named IND, DOM, and SNI) with different security levels, which we explain in detail in Section 5.3. In Section 6, we compare the concrete performance and perform practical leakage analysis. From these experiments, we conclude that our implementation of IND -the fastest instance among three -does not leak information, which provides some assurance.

Side-Channel Protections for Hashing in Picnic
In this section and Algorithm 3 we give a more detailed description of the parts relevant to this paper.
Parameters. M is the number of MPC instances, N is the number of parties, τ is the number of revealed online executions and κ is the security parameter (e.g., κ = 128 for security level L1). The circuit C, defined over the binary field F = {0, 1}, is also part of public parameters. Concretely, the circuit is Enc(w, p) where Enc is the LowMC block cipher with κ-bit key and block size, w is an κ-bit input witness (a LowMC secret key), p and c are the plaintext and ciphertext, both κ bits long. If the input to C is a block cipher key that maps p to c, the circuit outputs 1.

Key Generation.
In the presentation below, the key pair is (pk, sk) = ((c, p), w), where both p and w are random κ-bit strings, and then c is computed as c = Enc(w, p).

Hashing Operations for Signing
The concrete sign operations are described in Algorithm 3, following the Picnic specification [Pic20, Section 7.1]. When compared to the stylized description of KKW in Fig. 6, here we include more details and list all hashing operations, since we will analyze them with respect to the probing attacks below. Some of the functions related to expanding seeds using a tree construction or creating a Merkle tree of commitments (gen seed, get leaves, build tree, and open tree) are left to the specification for simplicity.
The hash function calls are denoted by H and we omit the byte used for domain separation present in the specification. The KDF expands an arbitrary length input to an arbitrary length output. Both H and KDF are instantiated with the SHAKE XOF.
We now consider which hashing operations must be protected against side-channel attacks, and to what degree. The Picnic specification supports randomized signatures (and recommends this option, following [AOTZ20]) by appending a random value to the KDF input when deriving the root seed. We assume this option is used throughout, as otherwise the cost of side-channel protections would be significantly higher, since all hash function calls would require masking (as opposed to only 35% shown below), and all random seeds would need to be T -encoded. First, we note that all inputs to the challenge computation are public, so this hash does not need to be masked. We now analyze the other hash function calls in order. Fig. 7 (in Appendix B) gives an overview of the operations.
Deriving the root seed. For step 2, the SHAKE XOF is used as a KDF to derive a root seed for signature generation. The input sk must be protected from side-channel attacks. As a first option, one could choose Algorithm 3 Description of Picnic signing highlighting hashing operations.
Input: signer's key pair sk = w = (w) w∈IN , pk, message to be signed Msg.
1: // Derive root seed: Sample random R ∈ {0, 1} 2κ , (seed * , salt) ← KDF(sk||Msg||pk||κ||R) 2: iSeed tree ← gen seed(seed * , salt, M, 0) // Tree of initial seeds 3: // Initial seed for each MPC instance: (iSeed (1) , . . . , iSeed (M ) ) ← get leaves(iSeed tree) 4: for each k ∈ [M ] 5: seed tree (k) ← gen seed(iSeed (k) , salt, N, j) // Seeds for MPC instance k 6: (seed1, . . . , seedN ) ← get leaves(seed tree (k) ) // N per-party seeds 7: For each i ∈ [N ]: tapes (aux (k) , tapes the root seed at random, and avoid the KDF altogether. However, deriving the root seed from the secret key and random data hedges against failures in the RNG, see the analysis for Picnic in [AOTZ20]. If sk is stored T -encoded, then we can hash all of the shares in place of sk, and append a random value. Our implementation masks this hash function call since it is relatively cheap in the context of a signature, and it makes testing easier because our implementation can produce signatures that match known test vectors.
Deriving other seeds. When generating the seeds in steps 3, 4, 6, and 7, protecting against the limited type of leakage we consider in this work is not necessary, since seeds are unique per-signature and are always hashed before use. Suppose an attacker A can read t bits of a leaf or intermediate seed s. With overwhelming probability each seed is only ever used in one signature, so traces from multiple signing operations will not give more information about s.
There are three possible uses of s to consider. When s is a seed from a leaf of the tree, case 1 is that s is hidden and the attacker has a commitment to it (computed in steps 14 and 16), and case 2 is when s is used to seed KDF (in step 9), and A has some of the output bits. In case 3, s is a hidden intermediate seed, the attacker has one of the two child seeds, derived by hashing s.
We can model all three cases as the attacker having C = H(s) along with t bits of s, where H is a secure hash function. In practice H is the SHAKE XOF, which the existing analysis of Picnic already assumes is a random oracle. Then if A makes q queries to H, they recover the missing κ − t bits of s with probability not more than q/2 κ−t . This considers only a single seed and digest, which we can do since each input to H is unique, by construction (the Picnic spec uses a domain separation value, random salt, and counters to prevent multi-target attacks [DN19]). In practice κ ≥ 128 and t will be 16 or less (see Section 2.3), therefore the security of our implementation is still at least 112 bits.
Computing random tapes. In step 9 we expand the per-party seeds to random tapes. The inputs do not need to be protected (as discussed in the previous paragraph), but all output bits must be protected, since some of the random tape bits will correspond to shares of the unopened party and must be kept secret as shown in in Section 3. We mask these calls, so the output is T -encoded (which increases the amount of memory required to store the tapes by a factor of T ).
Computing commitments. In step 14, a commitment of the form H(seed salt) is computed. Here the private input is a seed, which is not sensitive to leakage of up to t bits, as discussed above, and the output is public. Therefore, step 14 does not require masking.
In step 16, the last party's commitment has the additional input aux, which is sensitive to leaking of individual bits. We must mask this call, but since the output is public, we can use the half-masking technique of Section 5.3.
In step 18, we hash only public values, and no masking is required. In step 22, all inputs are sensitive to leaking individual bits (e.g.,ŵ is sensitive due to the attack described in Section 3.1). Because the output is public, the half-masking technique is applicable.

Masking SHAKE
We implement multiple methods to protect the SHA3 family of function against DPA attacks. In all of them, the Keccak-f state array A is secret shared into two arrays a, b, such that A = a + b. In the basic method proposed in [BDPA10], the linear operations are performed on the individual state arrays, then for the non-linear step (denoted χ) , evaluated left-to-right. The cost of the linear operations are doubled, addition of constants have the same cost, and the cost of χ is doubled, plus two additional AND and XOR operations, so the computational cost of the masked round function is roughly doubled. One must also consider the cost of generating random values to create the secret shares. This method (herein called IND) only achieves independence from the native variables, and the same approach can be generalized to three or more shares. In Domain-Oriented Masking (DOM) [GSM17], the AND operations between shares a and b are further protected to satisfy SNI-security with a random mask Z as (a i+1 b i+2 + Z) and (b i+1 a i+2 + Z), respectively. However, this is still not sufficient for the masked Keccak as a whole to be SNI secure: due to the θ-layer, which applies a linear transform to the state array A, both inputs to the AND gadget in χ depend on the same previous state bit. This is a typical pattern of insecure composition observed in [BBD + 15b, §2.3]. Therefore, the third method (denoted by SNI) achieves SNI-security by additionally refreshing shares of the state array A for every invocation of χ, as already suggested by Barthe  Half-Masked SHAKE. When expanding a seed to a random tape, we have shown that security is maintained when leaking a small part of the seed (t bits of fewer), so the input is not sensitive to this bounded leakage, but the output is sensitive. Conversely, when creating a commitment, the individual input bits may be sensitive but the output is public. An established assumption (for SHAKE128) is that security is preserved using only half the number of rounds, and there is a proposal called KangarooTwelve (K12) [VWA + 21], that uses 12 instead of 24 rounds. Therefore, for short inputs and outputs, one can view SHAKE as two calls to K12, and mask only one of the calls. In the case of sensitive inputs, we mask the first 12 rounds: an attacker who learns the state at the 13th round is effectively given a K12 digest of the input, which sufficiently hides the input under the assumption that K12 is a secure hash function. Similarly, when only the output bits are sensitive, we mask the last 12 rounds, any state bits observed by the adversary in round 11 does not leak useful information about the output assuming that K12 is secure.

Estimated Overhead of Hash Function Masking in Picnic
Here we provide a rough estimate of the overhead introduced by masking the SHAKE calls in Picnic3, which will have a high impact on the cost of signing since hashing is a large portion of the signing time (e.g., at L1 it is about 57% of the signing time on x64 [KZ20], and for our ARM M4 implementation it is about 71%).
-For seed tree hashing, we have about M +log 2 M hashes to compute the round seeds, and M N +M log 2 N hashes for the per-party seeds. None of these must be masked.
-For random tape expansion, we have M N hashes, all of which must be masked.
-For commitments, we have N M + 2M + log 2 M hashes and must mask 2M of these. The total number of hashes is thus 3M N + 3M + 2 log 2 M + log 2 N and M N + 2M of these must be masked. At L1, all hash operations involve one call to Keccak-f , so all calls have approximately the same cost. Again at L1, M = 250, N = 16 so we find that about 35% of hashing must be masked. Since all masked hash operations have either non-sensitive input or output they need only be half-masked (as explained below). Now suppose we focus on first order protection (the case T = 2), and assume that masked SHA-3 is about 2.73 times slower than unmasked SHA-3, and that a half-masked SHA-3 is about 1.95 times slower (these are the ratios from our implementation described in Section 6). Then we expect a 1.61x increase in time spent hashing in masked Picnic3, and 1.35x increase when half-masking is used.

Implementation and Experimental Evaluation
In this section we benchmark our implementation and discuss performance, then describe our experiments to ensure that our implementation is side-channel resistant in practice.

Implementation and Benchmarks
We implemented our masked version of Picnic signing and benchmarked it on the ARM Cortex M4, using the pqm4 [KRSS] suite and the STMicro developer board STM32F407G-DISC1. This board has one MB of flash memory and 192KB of RAM and comes equipped with a true random number generator implemented as a hardware peripheral. The microcontroller clock frequency ranges from 24 to 168MHz, so following standard practice our benchmarks were executed at the lowest frequency to avoid the impact of memory wait states [FA17,HL19].
Our implementation is derived from the Picnic optimized implementation, which is primarily optimized for x64 platforms, and is not well-optimized for the M4. As such, our implementation results aim to bound the overhead of masking countermeasures. We also focus only on first order protection, i.e., the case T = 2, and implement only the L1 parameter set picnic3-L1. As most of our countermeasures are general, we expect them to apply equally to more optimized M4 implementations of Picnic, and (with some effort) to implementations of other MPCitH-based proof systems.
Since our masked implementation produces picnic3-L1 signatures compatible with the specified version [Pic20], we do not repeat signature or key sizes in our benchmarks: public keys are 34 bytes, secret keys are 17 bytes, signatures are 12.4KB. All other parameters such as number of MPC parties, MPC instances, digest lengths, etc. are as specified in [Pic20]. For reference, the verification time in our implementation is 204M cycles.
In order to experimentally verify the absence of leakage, we make use of the FvR tests.
Masked Keccak. We implemented three different flavors of masked Keccak described in the last section: IND, DOM, and SNI. The implementation was built on top of the in-place 32-bit ARMv7-M assembly code found in the official Keccak code package 5 (XKCP) to operate over a double-sized state storing the two shares. We implement the same Keccak API used in Picnic by replicating functions over each share of the state, and modifying the round function to implement the non-linear operations. Because of the larger state, additional pressure was put on the registers and several intermediate variables had to be spilled onto the stack. This caused some additional performance overhead beyond the raw cost of masking. In order to prevent leakage, we took additional care to rotate registers between rounds to prevent them from loading shares of the same variable [BDPA10].
Benchmarks. In Table 1 we give cycle counts for our masked implementation with various options for how the hash function calls are masked. The masking cost for the non-hashing operations is 156M cycles, which represents an overhead of 1.5x over baseline. Since this is effectively doing the MPC simulation with 2-encoded values, we might expect a factor two slowdown rather than 1.5, however, this is explained by the fact that many of the operations to implement LowMC are ANDs and XORs with public constants, which are more efficient than operations on 2-encoded values. Then the cost when masking the hashing naively (by masking all operations), is given for the three Keccak masking options (SNI, DOM and IND) and we see that the overhead is 1203M, 829M and 396M cycles respectively. By using our analysis of Section 5.2, and selectively masking only sensitive hash function calls, the overhead for IND drops to 153M cycles, and all the way down to 86M cycles when we additionally use the half-masking optimization. In this most performant case, we have roughly 2/3 of the overhead accruing to the Picnic and MPC operations, and 1/3 to the hashing. The options for Picnic are "No" masking and T = 2 masking, as described in Section 5. For SHAKE, the "None" option indicates no masking is used for hash computations, "All-" prefix means every call is masked in one of three possible ways: SNI-secure [BBD + 16], Domain-Oriented Masking (DOM) [GSM17] or using independent values (IND) [BDPA10]. "Selective" means that only sensitive calls are masked with independent values (as described in Section 5.2), and "Selective Half" means that in addition to Selectively masking, we use half-masked SHAKE. The Hashing column provides the fraction of the signing time spent computing SHAKE.
Stack usage was essentially constant for all configurations we benchmarked, since the total amount of memory required is dominated by storing the signature, the commitment and seed trees and not by the storage space for intermediate values that we must t-encode. Code size increases by 1.2x in the most performant masked implementation (selective half-masked), and by 1.4x in the fully SNI-secure version. Finally, the randomness requirements range from the baseline of ≈ 2MB when Keccak is masked with the IND method, to the much higher 80MB for the DOM method and 158MB for the SNI method, as these methods require additional refreshing of nonlinear operations within Keccak. Using the selective half-masking option reduces the randomness requirements of the DOM and SNI options significantly, since the number of hash function calls is decreased and some calls are only half-masked. By our estimate in Section 5.4 this would reduce randomness usage by about 65% for the DOM and SNI options. In terms of the ≈ 2MB of randomness used for the non-hashing operations, these are partly due to the refresh operations within the LowMC implementation (Line 11), they are required for SNI security and since they did not have a significant impact on run time, we did not investigate the option of removing them. The other significant randomness consumer in the masked LowMC implementation is the masked AND operation (Algorithm 11). Here, future work could experiment with an implementation that masks ANDs as in the IND method for Keccak, with the aim of reducing randomness and improving run time.

Experimental Leakage Analysis
To ensure our masked implementations of Keccak and Picnic are practically side-channel resistant, we performed measurements of the implementation to confirm the absence of leakage. Our measurement setup comprises the STMicro developer board STM32F407G-DISC1 also used for the performance benchmarks, operated at 168MHz. We measure EM emanations using a Langer LF-U 2.5 near field probe connected to a Langer PA 303 preamplifier [EmP]. The EM probe is placed over the C29 blocking cap at a distance of approx. 1 mm. Measurements are recorded using a Tektronix MSO 6. For the Keccak implementation, we sampled at 3.125 GS/s with a 12 bit resolution and 200Mhz bandwidth. For the Picnic measurements, which are 2 orders of magnitude longer, we reduced the sampling rate to 625 MS/s in order to obtain feasible measurement time and storage sizes. Note that this still over-samples the board (168MHz) by a rate of 3.7 which is well above the minimal oversampling threshold of 2 from he Nyquist-Shannon sampling theorem.
Masked Keccak Leakage Evaluation. For Keccak we evaluate the IND method and follow the FvR approach to detect all possible first-order leakages. During the trace collection phase, a set of side-channel traces is collected by processing either a fixed input or a random input under the same conditions. The fixed or random choice for the input is made at random.After that, we calculate the means and standard deviations of the two side-channel trace sets separately. The t-test indicates whether the two distributions have the same mean, i.e., they are indistinguishable for a first-order SCA. We apply the customary threshold values for long traces 5.7 as suggested by [DZD + 18]. To show the sensitivity of the measurement setup for first-order leakages, we apply the test once to the correctly masked implementation and once to the same implementation with fixed masks. When masks are not chosen at random, the test must detect the resulting first-order leakage. The left hand side of the Fig. 4 shows the evaluation results of the masked Keccak implementation with fixed masks based on 2,000 measurement traces. As expected, the leakage test indicates strong leakage, with |t| clearly above 5.7. When masks are chosen uniformly at random, the t-value remains below 5.7 as shown in the right hand side of the Fig. 4, even if the number of measurement traces is increased to 1,000,000. We thus conclude that the masked Keccak implementation is secure and provides the expected resistance to first-order attacks.
Masked Picnic Leakage Evaluation. In order to analyze the leakage for the whole first order masked Picnic3 implementation, we follow a similar methodology as for Keccak and employ the FvR approach. We collect the traces starting at the beginning of signature generation until the end of the first MPC instance, i.e., including a single preprocessing phase and simulation of an online phase(Line 1 to Line 22 in Algorithm 4). Note that after this point, everything is public, and any leakage gives no additional information beyond what is made public in the signature. To analyze our signature implementation, we choose the FvR key scenario, under randomized messages, as proposed in [TG16] for asymmetric cryptosystems. 6 In addition we needed to add artificial wait cycles before accessing the board's hardware TRNG to ensure that we will never need to varying amounts of time for it to become ready,as this would destroy the constant time property required for the TVLA. Note that this change is only necessary for the test setup and not required in the production code.
The sets of side-channel traces are collected by signing a random message using either a fixed key or a random key. As shown in Fig. 5, the t-value remains below 6.1 using 100,000 traces which indicates the absence of leakage. Moreover maximum |t|-value is indeed bounded and have a stable pattern. Remark that an exploitable leakage as shown in Fig. 1 or Fig. 2 exceeds the threshold value within as small as 2,725 traces and has a clear increasing pattern. Thus we can conclude that the first-order masked Picnic3 implementation provides the expected SCA resistance.
Scaling to higher security levels and masking orders. Our implementation and experimental evaluation is limited to security level L1 and masking order T = 2. Since we expect the proportion of time spent on hashing vs. MPC simulation to be similar at levels L1 and L5 (as was the case for x64, see [KZ20, Table  7]), we expect the overhead of our masking techniques to be similar at L5 as well. When T increases, we can only make rough predictions. We expect running time overhead to increase quadratically and memory overhead to increase linearly, due to the asymptotic behavior of masking nonlinear operations, and the additional storage required for T -encoded values.

Conclusion and Future Work
In this paper we study the side-channel security of MPCitH proof protocols and related signature schemes. We found and demonstrated a new probing attack on the KKW proof protocol (as implemented by Picnic). We then show that masking the signing operations is a practical countermeasure for side-channel attacks, and prove our masked KKW and Picnic3 meet the standard security notion (NIo), with a mix of both manual proofs and formal verification with the maskVerif tool.
We implemented a masked version of the Picnic3 signature scheme for the ARM Cortex M4 as a case study, and found that the cost of masking (in terms of runtime) is high when we simply apply SNI-secure masking to all hashing operations. After careful analysis of the hashing operations, we found that the masking overhead can be quite reasonable (as low as 1.8x) under modest assumptions that we verified with practical leakage analysis of our implementation. With hardware support for side-channel protected hashing, our work shows that the overhead of masking the non-hashing parts of Picnic signing is about 1.5x, and our SNI analysis applies here.
Our flexible masked SHA-3 implementation is the first publicly available one, and will be useful to other projects as SHA-3 becomes more common. We also expect our half-masking optimization to find application in other implementations, as most hash operations have a non-sensitive input or output.
Performance improvements (while maintaining resistance to side-channel attacks) are an obvious direction for future work, both on the M4 and other embedded platforms. Reducing the amount of randomness consumed by our mitigations is also an interesting way to improve performance, together with generalizing to higher order protection efficiently.
Finally, an implementation that combines SCA resistance and resistance to fault attacks (perhaps leveraging the fault-resistance results for Picnic in [AOTZ20]) would also make a good follow-up work.

A Complete Description of the KKW Proof System
In Fig. 6 we present a three-round KKW proof system. We remark that commitments (i.e., generation of com (k) i and com on (k) ) are de-randomized and replaced by a hash function as suggested in [KKW18,§3] (which loses HVZK but is still sufficient for provable security of signature), and the protocol is mildly generalized to work with arithmetic circuits according to [dDOS19, BN20].

B Our Protected Picnic3 Implementation
In this section we give a detailed description of our masked Picnic3 implementation. Algorithm 4 has the toplevel signature generation function, that calls the other algorithms in this section. Fig. 7 gives an overview of the optimized hashing operations mentioned in Section 5.2, indicating which optimizations are applied to each one.
Notation. T is the number of shares used by our implementation, and masked values will be T -encoded. For LowMC, n is the blocksize, and the precomputed constants K i , L i and R i are as defined in Appendix E. The parameter N is the number of MPC parties, and M is the number of MPC instances.

Data Types and Helper functions.
-T -encoding: an additive secret sharing in GF (2). For a bit b, the T -encoding is a vector of T values For a bitstring s, the T -encoding is T bitstrings over GF (2) |s| that XOR to s. As in other parts of the paper we use b to indicate that b is T -encoded. • SMul: AND operation of two T -encoded values. We use Algorithm 11. For example, with 2-encoded inputs a and b, this algorithm outputs c = a ⊗ T b as where r is a fresh random bit. Note that c = c 1 + c 2 = (a 1 + a 0 )(b 1 + b 0 ) = ab.
• Additional functions: Two additional helper functions from the literature are described in Appendix C. These are for refreshing the randomness of a T -encoded value and decoding (or unmasking) a T -encoded value.
-matMul T : This is a generalization of the matMul matrix multiplication function in [Pic20, §6.4.4], modified to work on T -encoded input vectors. The matrix remains unshared. The input is a T -encoded vector v of length n, a n × n matrix M , and the output is the length-n vector vM . If T = 2, we have tapes: an object representing the N random tapes, one per party. In case we need to be explicit about individual per-party tapes, it is parsed as tapes = tape 1 || . . . , ||tape N . Each tape is expanded from a seed masked version of SHAKE that produces T -encoded outputs, and we store the T -encoded outputs. We also store a T -encoded representation of the the aux tape, the N -th party's share.
tapes to word(tapes, offset): Read one bit from each of the N tapes at the index offset, and output an N -bit word. When the tapes are T -encoded, the output word is also T -encoded.
tapes to parity T (n, tapes, offset): Reads n bits from each tape at the index offset, the strings s 1 , . . . , s N , returns a T -encoding of of N i=1 s i . Our implementation computes and stores a T -encoding of the parity tape (i.e., the XOR of all N tapes) and uses this to implement the tapes to parity function.

B.1 Specification of Fully Masked Picnic3
Algorithm 4 specifies the masked Picnic3 signing operations without the half-masked hashing optimizations that we described in Section 5.2. The notation is as defined elsewhere in the paper, and in the appendix on Protocol KKW Inputs Both prover and verifier receive circuit C as a statement. The prover also holds a witness w = (w)w∈IN such that C(w) = 1.
Values M , N , τ are parameters of the protocol.     Fig. 7. Summary of our masking protections and hashing optimizations from Section 5.2. The figure is fully expanded for one of the M MPC instances, and shows the signer's operations for the commit phase of the protocol (i.e., before the challenge is computed). Hash functions in green are half-masked hash with sensitive inputs, hash functions in blue are half-masked with sensitive outputs, functions in red are NI-secure gadgets, and the white functions are unprotected. The secret key (witness) is denoted sk and Msg is the message to be signed. In the figure, we omit hashing of (com1, . . . , comN ) into com off, since the inputs are public values that can be reconstructed from the signature, and therefore the hash computation is unmasked regardless of our optimization.

LowMC (Appendix E). All functions marked in orange modify T -encoded values during the computation.
The Unmask function takes a T -encoded value and returns the non-encoded value, by summing the shares after refresh (see Algorithm 13). We verified that Algorithm 4 is indeed NI secure with maskVerif for up to second order (implying it is also NIo for any public outputs).

B.2 Simulation of the Offline Phase
Algorithms 5 to 7 describe the protected preprocessing phase. The description is very similar to the preprocessing phase in the specification [Pic20, Section 7.4], however, the data types and helper functions are different. Primarily, all variables are T -encoded. Algorithm 6 describes our masked version of the LowMC S-box used for preprocessing (called by Algorithm 5), which in turn calls Algorithm 7 the AND operation for the preprocessing phase. Also, our presentation assumes that Algorithm 5 is used for signature generation only, since verification can use an unprotected implementation.

B.3 Simulation of the Online Phase
We now describe how the online phase of the MPC simulation is masked. Algorithm 8 is the MPC simulation for the online phase, implementing the LowMC circuit. For each AND gate, each party i broadcasts a bit and these are output to msgs i , these are also T -encoded. In Algorithm 9 we describe the S-box implementation used in Algorithm 8. Finally we have Algorithm 10 that describes the online simulation of an individual AND gate. The broadcast values (written to msgs i ) and output bit are also T -encoded. Recall that SMul is implemented with the ISW multiplier (Algorithm 11).
Note that we need to refresh T -encoded st before each invocation of masked sbox online (line 6 of Algorithm 8). On one hand, since every round of LowMC involves a linear transformation of st (line 1 and 7), every bit of st depends on all n bits of the previous st, which corresponds to a problematic composition pattern mentioned in [BBD + 15b, Diagram 1]. On the other hand, all the other gadgets are in fact characterized as an affine gadget, which can be security composed in an arbitrary fashion. Hence, inserting RefreshM as we do is necessary and sufficient for the entire construction to be provably NIo secure.
Storage of secret keys. We assume that the Picnic secret key (a bitstring of length n), is stored in a T -encoded representation. Picnic key pair generation may be modified to generate T -encoded secret keys, or an implementation may use regular key generation in a trusted environment (e.g., during device manufacture), and then encode the secret key. As this is not important for performance, our implementation takes the regular key and T -encodes it at the beginning of signing. Then the input to MPC simulation is the T -encoded value ŝ k = λ sk ⊕ T sk , where λ sk is the T -encoded random masks output by preprocessing.
Algorithm 5 masked offline (corresponds to compute_aux in Section 7.4 of [Pic20]) Input: The tapes tapes . The signer's public key pk = (c, p). Output: The n-bit key mask λ sk . The tapes is updated inside masked sboxaux.

D Specification of Unprotected Picnic3
For completeness, we include the full specifications of Picnic3 signing adapted from [Pic20]. The notation is as defined elsewhere in the paper, and in the appendix on LowMC (Appendix E).
There are r round and linear layer constants R i , L i , and r + 1 key matrices K i . The matrices are invertible, and the Picnic3 implementation uses the inverses K 0 −1 and L i −1 . LowMC keys are sampled uniformly at random from F n 2 . LowMC encryption starts by adding the first round key to the plaintext, which is followed by r rounds. Each round key is generated by multiplying the key with the key matrix K i . A single round of LowMC is composed of an S-box layer, a linear layer, addition with constants, and addition of the round key as shown in Algorithm 21. The S-box layer applies the same 3-bit S-box on the first 3 · s bits of the state. The S-box is defined as S(a, b, c) = (a ⊕ bc, a ⊕ b ⊕ ac, a ⊕ b ⊕ c ⊕ ab). The other layers only consist of F 2 -vector space arithmetic, all local operations in our MPC setting.
Algorithm 21 LowMC encryption. Parameters K i , L i and R i are as described in the text.
Require: plaintext p ∈ F n 2 and key k ∈ F n

F Additional Preliminaries on Security Notions for Masking Countermeasures
In the following, we fix some finite field (F, 0, 1, +, −, ·). 7 As explained above, we are working in the tprobing model which allows an attacker to obtain the value of t variables per run of the primitive. The most common technique to mitigate side-channel attacks is by encoding sensitive variables via an additive (or polynomial-based) secret sharing into T > t parts. We say that a vector (v j ) j∈[T ] ∈ F T is a T -encoding of v := j∈[T ] v j . For readability, we often write v instead of (v j ) j∈ [T ] . For a subset I ⊆ [T ], let x I = (x i ) i∈I and furthermore I = [T ] \ I. Variables are shared both to protect against side-channel attacks and as part of the MPC protocol. To distinguish between these situations, we call a sharing between parties in the MPC protocol a sharing and an encoding when the goal is to protect against side-channel attacks.
Without loss of generality, we only give the security definitions for circuits that receive a single encoded input x and produce a single encoded output y . In the following, we use the terms circuit and gadget interchangeably. Consider a (possibly randomized) gadget G, which on input x produces a value y according to some probability distribution G x . To ensure that the computation of G does not leak any information, we modify it into a gadget G that takes x as input and outputs y with Pr[G x = y ] = Pr[G x = y]. Informally, we want to argue that the t probes made by an attacker do not reveal any information about the sensitive input x. Assume that the attacker probes the values v (1) To formalize this intuition, we consider a distribution ensemble {D x } x ∈F T . This ensemble is a probability distribution on v ∈ F t , capturing the probed variables of the attacker. We say that {D x } x is perfectly simulatable from the indices I ⊆ [T ], if there is a probabilistic algorithm S that, on input x I , has output distribution exactly D x , i.e., Pr[S( Example 1. The simplest example for this notion of simulatability concerns projections. For example, if the ensemble {D x } x ∈F 2 is defined on the value y = x 1 , it can easily be simulated from I = {1}, as the knowledge of x 1 is sufficient to simulate y. Example 2. Consider the following distribution {D x } x ∈F 2 on the triple of values (y 1 , y 2 , t) with t = x 1 +r, y 1 = t · x 2 = (x 1 + r) · x 2 , and y 2 = x 2 . Here, r is uniformly sampled from F, independent from x . This distribution ensemble is perfectly simulatable from I = {2}: The value of t can simply be simulated by sampling a random element from F without any knowledge on x 1 . Now, both y 2 and y 1 can be simulated by using the knowledge about x 2 and the already simulated value t.

F.1 Non-interference, Strong Non-interference, and Public Outputs
The most basic security notion for a masking countermeasure is the t-privacy of a gadget G [ISW03]. Informally, this means that the information provided by t probes of outputs or intermediate variables can also be obtained by probing t input variables, as long as the inputs are an encoding of x. While the idea behind the notion is relatively simple, it is unfortunately not composable as the output of a t-private gadget is not necessarily a truly uniform encoding. The composition of two t-private gadgets is thus not necessarily t-private. In order to remove the requirement that the inputs have a certain distribution, the notion of non-interference was introduced [BBD + 15a].
Non-interference gets rid of the dependency of the uniformly encoded inputs, but a more subtle issue still prevents a composability result. To give an intuitive overview of this problem, consider a gadget G with two sensitive inputs x and x and sensitive output y . Non-interference now implies that for any I y [T ], the values y Iy can be simulated from x Ix and from x I x for two sets I x , I x of cardinality at most |I y |. Now, if G is used in another circuit, it might be the case that x and x are correlated (or even identical). Then x Ix∪I x might reveal information about x. See e.g., [BBD + 16] for a more detailed explanation. Hence, an even stronger notion, called strong non-interference was introduced [BBD + 16], that guarantees a clear separation between input variables and output variables. Hence, if no intermediate variable was probed, the output of the circuit is independent of its input (from observing at at most t positions) and thus all subsets of at most t outputs are input-ignorant. We will occasionally talk about the concrete distribution of these input-ignorant variables. If all (Int, O, t)-inputignorant output variables of a t-SNI gadget G are distributed according to a distribution D, we say that G is t-strong-non-interfering (t-SNI) with output-distribution D.

Definition 2 (t-NI, t-SNI). Let G be a gadget with inputs in F T and t < T . Suppose that for any set of
Finally, the SNI notion guarantees that the composition of two t-SNI gadgets is t-SNI again. In Appendix C we recall two t-SNI-secure gadgets SMul and RefreshM that we use as building blocks of our masked KKW proof system. For the sake of completeness, we repeat the corresponding proposition from [BBD + 16]. C be a circuit built from gadgets G 1 , . . . , G r such that all G i are t-NI, all encodings are used at most once as input of a gadget call other than RefreshM. Then C is t-NI. Moreover, C is t-SNI if it is t-NI and all encodings corresponding to the outputs of C are refreshed through RefreshM before output.

Lemma 3 (Proposition 4 [BBD + 16]). Let
The commonly used term probing-security can either mean privacy [BBC + 19] or non-interference [CGPZ16]. Classically, the non-interference notions only deal with gadgets where all of the inputs and outputs are sensitive. To also handle public, non-sensitive values, the notion of NI with public output (t-NIo) was proposed in [BBE + 18]. As mentioned in [BBE + 18, Lemma 1], if a gadget G is t-NI secure it is also t-NIo secure for any public outputs. Clearly, the same claim also holds for t-SNI and t-SNIo. While the KKW-protocol also contains public variables, we are able to show the stronger guarantee of t-(S)NI.

G Omitted Proofs
In this section we give formal security proofs for the SNI security of Algorithm 22 and 23.

G.1 Proof of Lemma 1
Proof. We thus need to show that for any set of t < T intermediate variables and any subset O ⊂ [ẑ 1 , . . . ,ẑ T ] of output shares such that t + |O| < T , for each input variable v, there is an input set I v with |I v | ≤ t such that the t intermediate variables and the output variables λ xy N O can be perfectly simulated from these input sets.
Both the computation of λ x and λ y are straightforward and can be simply simulated, as they are linear operations. Whenever one of the terms involved in the computation of λ xy ← SMul( λ x , λ y ) is probed, we add the corresponding values from the proof of the SNI-security (found e. g. in [BBD + 16, Proposition 2]) of SMul( ) to the input sets I v . The result λ xy can be simulated without any input as SMul( ) is SNI. To simulate the output later on, we add all λ xy i,j to the inputs I v . Finally, the computation of λ xy N is again linear. For the output, suppose that λ xy N,j was probed. There are two cases to consider: If λ xy j was probed, the inputs I v already contains all λ xy i,j and we can thus simulate λ xy N,j perfectly. If λ xy j was not probed, λ xy j looks like a uniformly random element from F that is not used anywhere else, as it is produced by a t-SNI gadget with uniform output-distribution. We can thus uniformly sample a random element r ∈ F and replace λ xy N,j by r. This implies strong non-interference.

G.2 Proof of Lemma 2
Proof. We thus need to show that for any set of t < T intermediate variables and any subset O of output shares such that t + |O| < T , for each input variable v, there is an input set I v with |I v | ≤ t such that the t intermediate variables and the output variables indexed by O can be perfectly simulated from these input sets. To show this, we go through all variables of the algorithm and explain for all input variables v which indices are added to I v .
Whenever one of the terms involved in the SMul( )-computation for a term a i,j , b i,j , or c j is probed, we add the corresponding values from the proof of the strong non-interference of SMul( ) to the input sets I v . Note that no inputs need to be added to I v if a i,j , b i,j , or c j were probed, as they are the result of a t-SNI gadget.
Whenever s i,j or a sub-term of s i,j is probed, we add the variables corresponding to a i,j , b i,j , λ z i,j , and λ xy i,j to the input sets I v . This clearly allows us to simulate all s i ,j and all sub-terms perfectly. Whenever a sum i i=1 s i,j (including s j itself) is probed, we distinguish two cases. If s 1,j , s 2,j , . . . , s i ,j were all probed, we can simply simulate the complete sum. Otherwise, there is a term s i ,j with i ∈ {1, . . . , i } such that s i ,j was not probed. As s i ,j is the only place where a i,j is used, we make use of the fact that a i,j is constructed by an t-SNI gadget with uniform output-distribution. In other words, this means that a i,j looks like a uniformly random element from F that is not used anywhere else. We can thus uniformly sample a random element r ∈ F and replace the complete sum i i=1 s i,j by r. Note that in the previous argument, we did not add anything to I v .
Finally, wheneverẑ j is probed, we simply simulate s j and c j . As c j is the result of a t-SNI gadget, we can simulate it without needing to add anything to the input sets I v . As shown in the discussion about s j , we can also simulate it without needing to add anything to the input sets I v . This implies strong non-interference.